Python for Data Science
import pandas as pd
import numpy as np
import seaborn as sns
%config Completer.use_jedi = False # faster auto complete
#plot
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
# set style
plt.style.use('classic')
%matplotlib inline
#get help on python command
print(help(len))
len?
''' or insert[Shift]+[Tab]'''
#also work on methods
A = [1,2,3]
A.insert?
#accessing source code
sum??
#Wild card search for object
Small_list = [1,2,3,4]
Big_list = [1,2,3,4,5,6,7,8,9]
#find every object that has list
*lis*?
#download dataset from url
# !curl -O https://xxxxxx/chipotle.tsv
!curl -O https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv
# chipo = pd.read_csv('chipotle.tsv', delimiter='\t') #delimiter (\t for tab) and (, for comma)
chipo = pd.read_csv('chipotle.tsv',delimiter='\t')
Lets Think of all data fundamentally as arrays of numbers.
For example, images—particularly digital images—can be thought of as simply twodimensional arrays of numbers representing pixel brightness across the area.
Sound clips can be thought of as one-dimensional arrays of intensity versus time.
Text can be converted in various ways into numerical representations, perhaps binary digits representing the frequency of certain words or pairs of words.
No matter what the data
are, the first step in making them analyzable will be to transform them into arrays of numbers.
NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data buffers. In some ways, NumPy arrays are like Python’s built-in list type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size
import numpy as np
np.__version__
List in python can store different type, Numpy lacks of this function give it much more efficient in terms of storing and operation
#Create from scratch
y = np.array([4,3,2],dtype='float32') #can add type to np.array (optional)
#Create from list
mylist = [1,2,3]
x= np.array(mylist)
print('y=',y)
print('x=',x)
# matrix fill with one
z = np.ones((3,5)) #create 3*5 matrix of 1
print('z=',z)
#range by .arange(start,stop,step)
m = np.arange(0,20,2)
print('m=',m)
#create array of random int
np.random.randint(5,10,(3,2)) #random between 5 - 10 of 3*2 matrix
| Data type | Description |
|---|---|
| bool_ | Boolean (True or False) stored as a byte |
| int_ | Default integer type (same as C long; normally either int64 or int32) |
| intc | Identical to C int (normally int32 or int64) |
| intp | Integer used for indexing (same as C ssize_t; normally either int32 or int64) |
| int8 | Byte (–128 to 127) |
| int16 | Integer (–32768 to 32767) |
| int32 | Integer (–2147483648 to 2147483647) |
| int64 | Integer (–9223372036854775808 to 9223372036854775807) |
| uint8 | Unsigned integer (0 to 255) |
| uint16 | Unsigned integer (0 to 65535) |
| uint32 | Unsigned integer (0 to 4294967295) |
| uint64 | Unsigned integer (0 to 18446744073709551615) |
| float_ | Shorthand for float64 |
| float16 | Half-precision float: sign bit, 5 bits exponent, 10 bits mantissa |
| float32 | Single-precision float: sign bit, 8 bits exponent, 23 bits mantissa |
| float64 | Double-precision float: sign bit, 11 bits exponent, 52 bits mantissa |
| complex_ | Shorthand for complex128 |
| complex64 | Complex number, represented by two 32-bit floats |
| complex128 | Complex number, represented by two 64-bit floats |
np.random.seed(0) #set seed
s = np.random.randint(1,10,(3,3))
s #create 3*3 matrix of random array
# Each array has attributes
print("s ndim: ", s.ndim,'where dim is number of dimension and in this case is R2') #access attribute by (array.method)
print("s shape: ", s.shape,' shape is 3*3 matrix')
print("s size: ", s.size,'size is total size shape times shape')
#access element in matrix
s[2,1] # get the third row and second column (zero-indexing)
#modify last row to 1,2,3
s[2,:] = [1,2,3] ;s
# x[start:stop:step]
k = np.random.randint(1,20,20)
print(k)
print(k[10::2]) # from mid to end and step 2 each time
print(k[::-1]) # reverse all element
#reverse matrix
s[::-1,::-1]
Normally Numpy return view So Copy need to to define explicitly
s_copy = s[1:,:2].copy()
s_copy
#reshape tp 1*9 matrix
print("original shape is",s.shape)
print('new shape is ' ,s.reshape(1,9).shape)
print(np.array([1,2,5,3]))
print('to ')
print(np.array([1,2,5,3]).reshape(4,1))
using
np.concatenate
np.vstack
np.hstack
x = np.array([1,2,3,4,5])
y = np.array([6,7,8,9,10])
print(np.concatenate([x,y])) # have to be same dimension
print(np.concatenate([s,s]))
print('concat along second axis')
print(np.concatenate([s,s],axis= 1)) # along the second axis (zero-index)
# Different dimension use np.vstack(same no. of columns) and np.hstack(same no. of rows)
x = np.array([1,2,3])
y = np.array([[6,7,8],
[4,5,6]])
print(np.vstack([x,y])) #"vertical concat"
print(np.hstack([y,y])) #"horizontal concat"
np.split
np.hsplit
np.vsplit
# hsplit = split columns wise
np.hsplit(y,3)[0] # split y into 3 column and return the view of the first onw
# split = row wise
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5]) # x1 = 0:3 (inclusive : exclusive) and x2 = 3:5 and x3 = 5: end
print(x1, x2, x3)
Because list is so flexible to store varios type in python, its also slow for computation (iteration)
Numpy.array is a lot faster if we compute by vectorized operation
# vectorized operation
print(y)
print('top is y and below is y+4')
print(y+4)
k = np.arange(1,11).reshape(2,5)
2**k #vectorized operation on multi-dimension
#Exponents and logarithms
x = [1, 2, 3]
print("x =", x)
print("e^x =", np.exp(x))
print("2^x =", np.exp2(x))
print("3^x =", np.power(3, x))
x = [1, 2, 4, 10]
print("x =", x)
print("ln(x) =", np.log(x))
print("log2(x) =", np.log2(x))
print("log10(x) =", np.log10(x))
# can specify where to store output given output already predefined
x = np.arange(1,6) # array of 1 to 5
x_2 = np.empty(5)
np.multiply(x,3,out = x_2) # multuply 3 to every elements and store in x_2
x_2
# reduce() = repeatedly apply the operation until we have one result left and return it
m = np.arange(1,6)
np.add.reduce(m)
# sum all have same result
np.arange(1,6).sum()
# Accumulate // cumulative
print(np.add.accumulate(m))
print(np.multiply.accumulate(m))
# Max, Mix
print('max value in m is',np.max(x),'or',x.max()) # Numpy ways much faster than Python way which is max(m) or min(m)
# on muti-dimensional
k = np.ones(12,dtype= 'int32').reshape(3,4) ; print(k)
print(k.sum()) # return sum of entire array
print(k.sum(axis = 0)) #sum by each column (specify by axis columns = 0 and row = 1(zero-index))
print(k.sum(axis = 1)) #sum by each rows
Note can use with axis parameter to operate row or column wise ex np.sum(x,axis=0) which is sum array column wise
| Function Name | NaN-safe Version | Description |
|---|---|---|
| np.sum | np.nansum | Compute sum of elements |
| np.prod | np.nanprod | Compute product of elements |
| np.mean | np.nanmean | Compute median of elements |
| np.std | np.nanstd | Compute standard deviation |
| np.var | np.nanvar | Compute variance |
| np.min | np.nanmin | Find minimum value |
| np.max | np.nanmax | Find maximum value |
| np.argmin | np.nanargmin | Find maximum value |
| np.argmax | np.nanargmax | Find index of minimum value |
| np.median | np.nanmedian | Find index of maximum value |
| np.percentile | np.nanper | Compute rank-based statistics of elements |
| np.any | N/A | Evaluate whether any elements are true |
| np.all | N/A | Evaluate whether all elements are true |
# Case example
import pandas as pd
data = pd.read_csv('data/president_heights.csv',index_col =0) #start with same path as this notebook store
data.head()
# find summary statistic
heights = data['height(cm)']
print("Mean height: ", heights.mean())
print("Standard deviation:", heights.std())
print("Minimum height: ", heights.min())
print("Maximum height: ", heights.max())
print("25th percentile: ", np.percentile(heights, 25))
print("Median: ", np.median(heights))
print("75th percentile: ", np.percentile(heights, 75))
Broadcasting is simply a set of rules for applying binary ufuncs (addition, subtraction, multiplication, etc.) on arrays of different sizes.
a = np.ones((3,3))
b = np.random.randint(1,20,3)
print(a,'has shape of ',a.shape)
print(b,'has shape of ',b.shape)
# a can combine with b by broadcasting(strech) b from 1*3 matrix to 3*3 so that they can be combine
a+b
# both can be boardcast at the sametime
a =np.arange(3)
b =np.arange(3).reshape(3,1)
print(a)
print(b)
a+b
#case example using broadcasting
x = np.random.randint(10,20,[10,3])
print(x)
x_mean = x.mean(axis = 0) ; x_mean
# we can find deviation by broadcasting the x_mean from 1*3 to 10*3 matrix
deviation = x - x_mean
deviation.mean(0)
Masking comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some criterion: for example, you might wish to count all values greater than a certain value, or perhaps remove all outliers that are above some threshold.
x = np.arange(1,11) #array of 1 -10
print(x)
print(x<3)
print(x>=3)
print(x != 3)
# element-by-element comparison of two arrays
(2*x) == (x**2)
# Count all entries (different way of counting)
y = np.arange(1,7).reshape(3,2)
print(y)
print(y.size)
print(np.count_nonzero(y))
# count conditional where y<6
print(np.count_nonzero(y<3))
print(np.sum(y<3)) #in this case, False is interpreted as 0, and True is interpreted as 1
print(np.sum(y<3,axis= 0)) # aggregation functions can define axis for row or column wise operation
# check any and all values
print(np.any(y>6)) # are there any value in y that is more than 6
np.all(y<10) # Are all value in y less than 10
| Operator | Equivalent ufunc | Description |
|---|---|---|
| & | np.bitwise_and | and |
| | | np.bitwise_or | or |
| ^ | np.bitwise_xor | bit wise exclusive or |
| ~ | np.bitwise_not | not |
*note xor rules will examine in bit mode is 0^0 =0 and 1^1=0 and 0^1 = 1 and 1^0 =0
# Load case example from Seattle file // do summary statistic
data = pd.read_csv('data/Seattle2014.csv')['PRCP'].values
inches = data/254
inches.shape
# ((condition1) & (condition2)) == (()and())
np.sum((inches<1) & (inches>0.5))
print("Number days without rain: ", np.sum(inches==0) )
print("Number days with rain: ", np.sum(inches>0))
print("Days with more than 0.5 inches:", np.sum(inches>0.5) )
print("Rainy days with < 0.1 inches :",np.sum((inches>0) & (inches<0.1)) )
to select particular subsets of the data themselves.
x = np.random.randint(1,10,12).reshape(3,4)
print(x)
#obtain subset of x where values is less that 5
#step 1 boolean mask
print(x<5)
#step 2 return every value that index is True (which mean less than 5)
x[x<5] # returned a one-dimensional array with all values that meet condition; in other words, all the values in positions at which the mask array is True.
rainy_data = inches[inches>0] # return all value that is more than 0
# construct a mask of all summer days (June 21st is the 172nd day)
summer_day = (np.arange(365) - 172 <90) & (np.arange(365) - 172 >0) # list range from 1 to 365 and -172 then summer day will be between 1 and 90
print("Median precip on rainy days in 2014 (inches): ", np.median(inches[inches>0]))
print("Median precip on summer days in 2014 (inches): ",np.median(inches[summer_day]))
print("Maximum precip on summer days in 2014 (inches): ",np.max(inches[summer_day]))
print("Median precip on non-summer and rainy days (inches):",np.median(inches[(inches>0) & ~summer_day]))
r = np.random.RandomState(42) # set seed to this variable
x = r.randint(100, size = 10)
x
# normally we access value by index
print(x[[0,1,2,3]] )# return value in index 0,1,2,3
#same as
ind = [0,1,2,3]
x[ind]
#with fancy index result copy shape of the index
ind2 = np.array([[0,1],
[2,3]])
x[ind2]
#another example
y = np.arange(12).reshape(3,4)
print(y)
y[[0,1,2],[2,1,3]] # mean return array of first (row0,col2) ,(row1,col1) ,(row2,col3) by matching row and column
# Slicing with fancy index
y[1:,[3,1,2]] #return everything from row 1 that is column 3 ,1 and 2 respectively
x = np.arange(10)
i = np.array([2,1,8,4])
x[i] = 99
print(x)
x[i] -= 10
print(x)
np.sort ~ O N log N which is much more efficient that .sort() in normal python
np.argsort ~ returns the indices of the sorted elements
note for pd.DataFrame use $.sort\_values()$
np.sort(x) # not inplace
x.sort() #sort inplace
x = np.array([2, 1, 4, 3, 5])
i = np.argsort(x) # return the index of sorting array
x[i] # which can then be use to sort with fany index
x = np.random.randint(1,100,20).reshape(5,4)
x
np.sort(x,axis = 0) # axis =0 is sort by column
$np.partition()$ function.
np.partition takes an array and a number K; the result is a new array with the smallest
K values to the left of the partition, and the remaining values to the right, in arbitrary
order
$np.argpartition$ ~ return index
x = np.array([4,521,21,9,95,77,2])
# basically return 'k' smallest number to the leftmost and other number in random(arbitary) order
np.partition(x,2) # return two smallest which is 2 and 4 in the left most other are in random order
data = np.zeros(4)
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
data
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)
# Get all names
print(data['name'])
# Get names where age is under 30
print(data[data['age'] < 30]['name'])
focus on the mechanics of using Series, DataFrame, and related structures effectively.
import pandas
pandas.__version__
three fundamental Pandas data structures: the Series, DataFrame, and Index.
A Pandas Series is a one-dimensional array of indexed data.
The essential difference with np.array is the presence
of the index: while the NumPy array has an implicitly defined integer index used
to access the values, the Pandas Series has an explicitly defined index associated with
the values. (index can be define manually as anything (float,string.etc))
$pd.Series(data, index=index)$
where index is an optional argument, and data can be one of many entities.
p = pd.Series([100,200,300,400])
print(p)
p.values # .values are simply np.array
# index can be anything ,also string
p = pd.Series([100,200,300.5,400],
index = ['a','b','c','55'])
p
# construct from python dictionary
p_dict = {'California': 38332525,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(p_dict)
population
# data can repeatedly fill it self
pd.Series('Copy me', index=[1, 2, 3])
DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.
#create DataFrame by combine pd.Series and dictionay in dictionary format
area = {'California': 55,
'Texas': 56,
'New York': 54,
'Florida': 89,
'Illinois': 123}
dat = pd.DataFrame({'area' : area,
'pop' : population}) # add column
dat
print(dat.index) #access index(row)
print(dat.columns) #access attribute(columns)
dat['area'] #access attribute(columns)
#create df specify columns and index
D = pd.DataFrame(['Apple','Orange',"Pineapple"], index = [1,2,3], columns = ['Fruit'])
D
# create specify index from series
purchase_1 = pd.Series({'Name': 'Chris',
'Item Purchased': 'Dog Food',
'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
'Item Purchased': 'Kitty Litter',
'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
'Item Purchased': 'Bird Seed',
'Cost': 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
df
can be thought of either as an immutable array or as an ordered set
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
print(data)
data['b'] # access value with df['index']
# examine index in Series / DataFrame
print('c' in data)
print(data.keys())
print(list(data.items()))
#add value "row wise"
data['e'] = 1.25
data
# Slicing
print(data['b':'d']) # slice from index
print(data[1:4]) # slicing by implicit integer index (zero- index and (inclusive, exclusive))
# masking
data[(data > 0.3) & (data < 0.8)]
# iloc is implicit index (zero index)
data.iloc[0]
# loc is explicit index (non-zero index)
data.loc['a']
print(dat['area']) # access via dictionary style
dat
dat.area # access via attribute style (but not prefer)
dat.area is dat['area']
dat['density'] = dat['area'] / dat['pop']
dat
dat.T # Transpose
# index like in np.array with df.values to extract array
dat.values[0] # get first row
dat.iloc[:3,1:]
dat.loc[:'Illinois','pop':]
#The ix indexer allows a hybrid of these two approaches:
dat.ix[:3,'pop']
dat.loc[dat['area']>80,['area']] # using with mask (return only area columns)
# modify value to 5
dat.density['Florida'] = 5
# is same as
dat.loc['Florida','density']= 5
# is same as
dat.iloc[1,2] = 5
dat
# using fancy index to specify the column and row you want
dat.loc[['Illinois','California'],['pop','density']]
indexing
refers to columns,
slicing refers to rows
#combind masking with index
dat[dat['pop']>15882135].loc['Florida':'New York','pop':'density']
area = pd.Series({'Alaska' : 170,
'Texas' : 180,
'California': 423967},
name='area')
population = pd.Series({'California': 38332521,
'Texas': 26448193,
'New York': 19651127},
name='population')
area, population
population + area #The resulting array contains the union of indices of the two input arrays
area.index | population.index
# add condition for empty) value = 0 then add
area.add(population,fill_value=0)
Mapping between Python operators and Pandas methods
Note ~ can use condition like fill_value for empty value
| Python operator | Pandas method(s) |
|---|---|
| + | add() |
| - | sub() |
| * | mul(), multiply() |
| / | truediv(),div() |
| // | floordiv() |
| % | mod() |
| ** | pow() |
# operate row - column wise in DataFrame
D = pd.DataFrame(np.arange(1,13).reshape(3,4), columns=['A', 'B','C','D'])
D
# row wise (minus everything with first row, row wise)
D - D.iloc[0]
#column wise (minus everything with 'B' column, column wise) # have to use method
D.sub(D['B'],axis=0)
we’ll refer to missing data in general as null, NaN, or NA values.
Strategies
Sentinel value
# np.nan has float64 type
data_with_nan = np.array([1,2,3,np.nan,4])
data_with_nan.dtype
# anything + - * / with nan = nan
data_with_nan + 5
# need to use special aggregate function like np.nanmax, np.nansum, etc
print(data_with_nan.max())
np.nanmax(data_with_nan)
Sentinel value
Null values
tools
$isnull()$ ~ Generate a Boolean mask indicating missing values
$notnull()$ ~ Opposite of isnull()
$dropna()$ ~ Return a filtered version of the data
$fillna()$ ~ Return a copy of the data with missing values filled or imputed
data = pd.Series([1,None,2,np.nan])
print(pd.isnull(data))
# same as
print(data.isnull()) # True for null (good for filter for null values)
# opposite to
data.notnull() # False for null (good for filter for 'not' null values)
data[data.notnull()] # filter for 'not' null values
print(data.dropna() == data[data.notnull()])
print(data.dropna() )
# for data frame , can use fillna()
df = pd.DataFrame([[1, np.nan, 2],
[2, 3, 5],
[np.nan, 4, 6]])
df
# We cannot drop single values from a DataFrame; we can only drop full rows or full columns
df.dropna() # drop if any na in row
df.dropna(axis= 'rows')
df.dropna(axis= 'columns') #diff from numpy axis, pd axis = 1 us column and 0 is row so better say explicitly columns or rows
# control dropping rows or columns with all NA values, or a majority of NA values.
# parameter 'thresh' or 'how'
df[3] = np.nan
print(df)
df.dropna(how='all', axis='columns')
# thresh parameter lets you specify a minimum number of (non-null values) to kept
df.dropna(thresh=3) # kept only if row have 3 or more of non-null
# sample df
pushup_df = pd.DataFrame({'Name':['Samantha','Julia','Meera','Ashely','Ketty'],
'Occupation':['Doctor','Actor','Doctor','Singer','Actor']})
pushup_df
# categorize by Occupation
pivot_1 = pushup_df.pivot(columns='Occupation')
pivot_1# many NaN
# remove only NaNs and shift other value up
pivot_1.apply(lambda x : pd.Series(x.dropna().values)).reset
# fillna() method returns a 'copy' of the array with the null values replaced.
df.fillna(0)
# can fill using mask with .isna() will be 'inplace'
df[df.isna()] = 0
df
df
# forward fill (copy value from previous row or column)
df.fillna(method='ffill') #rows (copy from next row)
# backward fill (copy value from following row or column)
df.fillna(method='bfill',axis='columns') #column (copy from following column)
One easy way to remove these all at once is to cut outliers; we’ll do this via a robust sigma-clipping operation:1
#Sample data
!curl -O https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv
births = pd.read_csv('births.csv')
print(births.shape)
births.head(3)
quartiles = np.percentile(births['births'], [25, 50, 75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0]) # a robust estimate of the sample mean, where 0.74 comes from the interquartile range of a Gaussian distribution
# then use query to filter out rows outside these values
births_1 = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')
births_1.shape # row were reduce from 15547 to 14610
Another way is normalization to Z-score and filter in value with less than 3 and more then -3, Z score
from scipy import stats
births['zscore'] = np.abs(stats.zscore(births.births)) # assign new column of absolute Z score
births_2 = births[(births.zscore < 3) ]
births.drop('zscore', axis='columns', inplace=True)
births_2.drop('zscore', axis='columns', inplace=True)
births_2.shape # drop from 15547 to 15067
aka multi indexing
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
pop = pd.Series(populations, index=index)
pop
# use panda multiIndex to get all 2010
# create multiIndex
index = pd.MultiIndex.from_tuples(index)
index
# reindex our series with this MultiIndex, we see the hierarchical representation of the data:
pop = pop.reindex(index)
pop
#access data
pop[:,2010]
# convert a multiplyindexed Series into a conventionally indexed DataFrame with unstack()
pop_df = pop.unstack()
pop_df
# convert back to multi index
pop_df.stack()
pop_df = pd.DataFrame({'total': pop,
'under18': [9267089, 9284094,
4687374, 4318033,
5906301, 6879014]})
pop_df
# fraction of people under 18 by year, given the above data
(pop_df['under18'] / pop_df['total']).unstack()
# construct
df = pd.DataFrame(np.random.randint(1,100,(4,2)),
index = [['a','a','b','b'], [1,2,1,2]] , # multi Index (index = [[first index], [ second index]])
columns= ['dat1', 'dat2'])
df
pop
#assign name
pop.index.names = ['state', 'year']
pop
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
data = np.round(np.random.randn(4, 6), 1)+20
health_data = pd.DataFrame(data, index=index, columns=columns)
print(data)
health_data
# access data with data[column or names][row or index]
health_data['Guido','HR'][2013,1]
pop
#indexing
print(pop['New York',2010])
#Partial slicing on lower levels
pop.loc[:,2000]
#masking
pop[pop>20851820]
#fancy index
print(pop[['California', 'Texas']])
print(pop[['California', 'Texas']][:,2010]) # filter on lower lever on chained []
health_data
columns are primary in a DataFrame
# access data with data[column or names][row or index]
health_data['Sue']
health_data.iloc[:2, :2]
health_data.loc[2013,['Bob','Sue']]
# use an IndexSlice object,which Pandas provides for precisely this situation
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]
the MultiIndex slicing operations will fail if the index is not sorted
data = pd.Series(np.random.rand(6), index=pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]]))
data.index.names = ['char','int']
data
# step of access data for Series(only 1 column)
data[['a','b']]['a'][1] == data['a'][1]
# slicing reqire multiIndex to be sorted : sort_index()
data = data.sort_index()
data # a,c,b -> a,b,c sort by alphabet
# now can slice
data[:'b']
pd.Series(np.random.rand(6), index=pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]],sortorder=2))
health_data
health_data.stack()
pop
pop.unstack(level=0)
pop.unstack(level=1)
turn the index labels into columns; with the reset_index method. (reset all multiIndex)
flat = pop.reset_index()
flat # look like real world data
# convert flat in to multiple index
flat.set_index(['state','year'])
for Multi-index. aggregation methods, such as mean(), sum(), and max(). can be passed a level parameter that controls which subset of the data the aggregate is computed on.
health_data
mean = health_data.mean(level='visit')
mean
mean.mean(level='type',axis= 'columns')
health_data.median(level='type',axis='columns')
# For convenience, we’ll define this function, which creates a DataFrame of a particular form that will be useful below:
def make_df(cols, ind):
"""Quickly make a DataFrame"""
data = {c: [str(c) + str(i) for i in ind] for c in cols}
print(data)
return pd.DataFrame(data, ind)
make_df('ABC', range(3))
# concat with pd.concat()
# By default, the concatenation takes place row-wise within the DataFrame (i.e.,axis=0).
A = make_df('ABC', range(3))
B = make_df('DEF', range(4))
pd.concat([A,B], axis=1).fillna(0) #concat column wise with axis = 1
# duplicate indices
A = make_df('ABC', range(2))
B = make_df('ABC', range(3))
pd.concat([A,B]) # concat will preserve value and replicate the duplicate row index
# verify that the indices in the result of pd.concat() do not overlap, you can specify the verify_integrity flag
try :
pd.concat([A,B], verify_integrity=True)
except ValueError as e :
print("error: ",e)
# ignored index
pd.concat([A,B], ignore_index=True) # it will reindex to 0,1,2,3,4,...
# or concat with multiIndex
pd.concat([A,B], keys=['X','Y'])
# set up sample data frame
staff = pd.DataFrame([
{'Name':'Kelly','Role':'admin'},
{'Name':'John','Role':'data engineer'}
])
staff = staff.set_index('Name')
student = pd.DataFrame([
{'Name':'John','School':'Engineer'},
{'Name':'PK','School':'Business'}
])
student = student.set_index('Name')
print(staff)
print(student)
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
print(df5); print(df6); print(pd.concat([df5, df6]))
# concat with 'inner' join
pd.concat([df5,df6],join='inner')
# concat with specify column
pd.concat([df5,df6], join_axes=[df6.columns])
pd.concat([df5,df6]) == df5.append(df6) # samething
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
print(df1); print(df2)
df3 = pd.merge(df1,df2)
df3
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
print(df3); print(df4); print(pd.merge(df3, df4))
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting','Engineering', 'Engineering', 'HR', 'HR'],
'skills': ['math', 'spreadsheets', 'coding', 'linux','spreadsheets', 'organization']})
df5
# many value got duplicate so shold consider use option in pd.merge()
pd.merge(pd.merge(df3, df4), df5)
work only same column name
print(df1) ; print(df2)
pd.merge(df1,df2,on='employee' )
specify each column
left_on ~ specify with column to join from left table
right_on ~ specify with column to join from right table
# can join column with diff name
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'salary': [70000, 80000, 120000, 90000]})
print(df1); print(df3)
df1_df3 = pd.merge(df1,df3, left_on='employee', right_on='name')
df1_df3
# the result has redundant column that we can drop
df1_df3.drop('name',axis='columns') # if column has to specify axis
# merge on index
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
print(df1a); print(df2a)
pd.merge(df1a,df2a, left_index=True, right_index=True)
# same thing as join() which is merge on index
pd.merge(df1a,df2a, left_index=True, right_index=True) == df1a.join(df2a)
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
'food': ['fish', 'beans', 'bread']},
columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
'drink': ['wine', 'beer']},
columns=['name', 'drink'])
print(df6); print(df7)
# merge default with inner join
pd.merge(df6, df7)
# get everything (Union)
pd.merge(df6,df7, how='outer')
# also 'left' and 'right' for left join and right join
pd.merge(df6,df7, how='left')
Deal with acase where your two input DataFrames have conflicting column names.
df8 = pd.DataFrame( {'name' : ['Bob','Jake', 'Lisa', 'Sue'],
'rank' : [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [3, 1, 4, 2]})
print(df8); print(df9)
# conflict value will create two separate columns _x,_y
pd.merge(df8, df9, on="name")
# specify the suffixes instead of _x, _y
pd.merge(df8,df9, on='name',suffixes=(' from left table',' from right table'))
#Example case of data merge
# ex.1 rank US states and territories by their 2010 population density.
pop = pd.read_csv('data/state-population.csv')
areas = pd.read_csv('data\state-areas.csv') #read_csv work with both '/' and '\'
abb = pd.read_csv('data/state-abbrevs.csv')
pop.head()
abb.head()
areas.head()
# join pop with abb with outer
df1 = pd.merge(pop,abb, left_on='state/region', right_on='abbreviation',how='outer')
df1 = df1.drop('abbreviation',axis = 1)
df1
# check NaN value for data cleaning
df1.isnull().any() # population and state have some null values
# find what is Nan
df1[df1['population'].isnull()] # state/region is Puerto Rico prior to the year 2000 (old data)
# return state/region as series that has 'state' value null and .unique for get only unique value
df1.loc[df1['state'].isnull()]['state/region'].unique()
# fill the 'state' column of 'PR' and 'USA'
df1.loc[df1['state/region']== 'PR','state'] = 'Puerto Rico'
df1.loc[df1['state/region']== 'USA','state'] = 'United States'
df1
# check again state is okay, since we dont have population information we can finish with data cleaning
df1.isnull().any()
# go on and merge with areas
df2 =pd.merge(df1,areas,on='state',how='left')
df2
# check again
df2.isnull().any() # some area is null
# get unique state that has null value of 'area (sq. mi)'
df2.loc[df2['area (sq. mi)'].isnull(),'state'].unique()
# jsut drop united states
df2.dropna(inplace=True)
df2
# ex.1 rank US states and territories by their 2010 population density.
# step 1 filter only year 2010 (could use package numexpr with query() for faster query())
df3 = df2[df2['ages']=='total']
df4 = df3[df3['year']==2010 ]
df4 # this is all data we need to answer ex.1
# find that answer
df4['pop_density'] = df4['population'] / df4['area (sq. mi)']
df4.head()
# get only column state and pop_density for answer
df5 = df4[['state','pop_density']]
# change index to state
df5.set_index('state',inplace = True)
#sort by desc
df5.sort_values('pop_density',ascending= False).head()
#import planet data set from seaborn
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape
planets.head()
df = pd.DataFrame({ 'A' : [1,2,3],
'B' : [4,5,6]})
print(df.sum())
df
# or aggregate along row with axis = 'columns'
df.sum(axis = 'columns')
# summary statistic with describe()
planets.dropna().describe()
Listing of Pandas aggregation methods
| Aggregation | Description |
|---|---|
| count() | Total number of items |
| first(), last() | First and last item |
| mean(), median() | Mean and median |
| min(), max() | Minimum and maximum |
| std(), var() | Standard deviation and variance |
| mad() | Mean absolute deviation |
| prod() | Product of all items |
| sum() | Sum of all items |
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)},
columns=['key', 'data'])
df
# using groupby
df.groupby('key') # what is returned is not a set of DataFrames, but a DataFrameGroupByobject.
# it is a special view of the DataFrame, which is poised to dig into the groups but does no actual computation until the aggregation is applied
# to produce a result, we add the aggregate function
df.groupby('key').sum()
# object can be index with [] same as DataFrame
planets.groupby('method')['orbital_period'].mean()
# also can be interated
for (method,group) in planets.groupby('method'):
print("{0:30s} shape={1}".format(method, group.shape)) # this can also be done with apply()
# using group by with describe
planets.groupby('method')['orbital_period'].describe()
GroupBy objects have aggregate(), filter(), transform(), and apply()
df = pd.DataFrame({ 'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': np.random.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
df
# aggregate()
df.groupby('key').aggregate(['first', np.argmin , 'prod']) # use with aggregate function to specify directly
# also can customize for each column
df.groupby('key').aggregate({'data1' : ['min','max'], 'data2' : 'max'})
#filtering() drop data based on the group properties.
def filter_func(x):
return x['data2'].std() > 2
print(df); print(df.groupby('key').std());
# for only A and B for data2 have more than 2 std
df.groupby('key').filter(filter_func) # return original table with filter
# Transformation .transform()
# While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine
df.groupby('key').transform(lambda x: x - x.mean())
# apply() .apply()
# lets you apply an arbitrary function to the group results. The function take a DataFrame, and return either a Pandas object or a scalar;
#sample normalize column by Z-score
def normalize(x):
''' x is a DataFrame'''
x['data1'] = (x['data1'] - x['data1'].mean()) / x['data1'].std() # find Z score
x['data2'] = (x['data2'] - x['data2'].mean()) / x['data2'].std()
return x
df.groupby('key').apply(normalize)
The key can be any series or list with a length matching that of the DataFrame
L = [0, 1, 0, 1, 2, 0] # specify column to groupby
df
# group by array
df.groupby(L).sum()
# this one means sum first, third and sixth column to 0th row. sum second and forth column to 1st row. sum fifth column to 2nd row
print(df['key'])
df.groupby(df['key']).sum() # great way accumulate for keys
# A dictionary or series mapping index to group.
df2 = df.set_index('key')
mapping = {'A': 'vowel',
'B': 'consonant',
'C': 'consonant'}
print(df2); print(df2.groupby(mapping).sum()) #aggregate for keys
# Any Python function. (lower case)
df2.groupby(str.lower).mean()
# can also use multiIndex
df2.groupby([mapping,str.lower]).mean()
# Grouping example : count discovered planets by method and by decade
planets.head()
# create decade per row
decade = (planets['year']//10)*10
decade = decade.astype(str)+'s'
decade.name = 'decade'
# group multiIndex by method and decade
planets.groupby(['method',decade])['number'].sum()
planets.groupby(['method',decade])['number'].sum().unstack().fillna(0)
think of pivot tables as essentially a multidimensional version of GroupBy aggregation
# use example data from sns
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head()
# find survival rate by sex and class with groupby
titanic.groupby(['sex','class'])[['survived']].mean().unstack()
# with multidimension better use Pivot table
titanic.pivot_table('survived',index='sex', columns='class') # more read-able than groupby
grouping in pivot tables can be specified with multiple levels interested in looking at age as a third dimension.
# create age range per row and use as index
age = pd.cut(titanic['age'], [0,18,80])# create range from 0-18 and 18-80
titanic.pivot_table('survived', index=['sex', age], columns='class')
# range for column
fare = pd.qcut(titanic['fare'], 2) # create auto quantile 2 range
titanic.pivot_table('survived', index=['sex',age], columns=[fare,'class'])
# margin for ALL
titanic.pivot_table('survived', index='sex', columns='class', margins=True)
# sample data with Birth date
births = pd.read_csv('data/births.csv')
births.head()
decade = (births['year'] //10) *10 # create decade range by 10 years
births.pivot_table('births', index=decade, columns='gender')
%matplotlib inline
import matplotlib.pyplot as plt
sns.set() # use Seaborn styles
births.pivot_table('births', index='year', columns='gender', aggfunc='sum').plot()
plt.ylabel('total births per year');
Pandas string operations, clean up string with
$.str$
names = pd.Series(['peter','Paul',None,'MARY','Party','gUIDO'])
names
# Capitalize all elements while skipping missing values
names.str.capitalize()
'''
All Python String method
len() lower() translate() islower()
ljust() upper() startswith() isupper()
rjust() find() endswith() isnumeric()
center() rfind() isalnum() isdecimal()
zfill() index() isalpha() split()
strip() rindex() isdigit() rsplit()
rstrip() capitalize() isspace() partition()
lstrip() swapcase() istitle() rpartition()
'''
# ex. return names start with P
names[names.str.startswith('P').fillna(False)] # see that peter is in lower case so not return
Mapping between Pandas methods and functions in Python’s re module
| Method | Description |
|---|---|
| match() | Call re.match() on each element, returning a Boolean. |
| extract() | Call re.match() on each element, returning matched groups as strings |
| findall() | Call re.findall() on each element. |
| replace() | Replace occurrences of pattern with some other string. |
| contains() | Call re.search() on each element, returning a Boolean. |
| count() | Count occurrences of pattern. |
| split() | Equivalent to str.split(), but accepts regexps. |
| rsplit() | Equivalent to str.rsplit(), but accepts regexps. |
Other Pandas string methods
| Method | Description |
|---|---|
| get() | Index each element |
| slice() | Slice each element |
| slice_replace() | Replace slice in each element with passed value |
| cat() | Concatenate strings |
| repeat() | Repeat values |
| normalize() | Return Unicode form of string |
| pad() | Add whitespace to left, right, or both sides of strings |
| wrap() | Split long strings into lines with length less than a given width |
| join() | Join strings in each element of the Series with passed separator |
| get_dummies() | Extract dummy variables as a DataFrame |
names.str[0:3] # get slice of first three character
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam','Eric Idle', 'Terry Jones', 'Michael Palin'])
# get last name
print(monte)
monte.str.split().str.get(-1) # split give list of twe str and we get the last one with get
full_monte = pd.DataFrame({'name': monte,
'info': ['B|C|D', 'B|D', 'A|C', 'B|D', 'B|C','B|C|D']})
full_monte
#get dummy variables
full_monte['info'].str.get_dummies('|')
# numpy datetime64 object
date = np.array('2015-07-04', dtype=np.datetime64)
date
# then can do vectorize operation
date + np.arange(12)
# date time in panda
date = pd.to_datetime('4th of july, 2016')
print(date.strftime('%D')) #return date
print(date.strftime('%Y')) #return month
print(date.strftime('%d')) #return day
date + pd.to_timedelta(np.arange(12), 'D') #add 12 day
dates = pd.to_datetime([ '4th of July, 2015','2015-Jul-6', '07-07-2015', '20150708'])
dates - dates[0] # find time different
return date return
$pd.date\_range()$ for timestamps,
$pd.period\_range()$ for periods,
$pd.timedelta\_range()$ for time deltas.
print(pd.date_range('2015-07-03','2015-07-10')) # return range (start,stop)
print(pd.date_range('2015-07-03',periods = 8,freq='h')) # return range (start,how many day)
pd.timedelta_range(0, periods=10, freq='ns') # return only hour
pd.timedelta_range(0, periods=9, freq="2H30T") # step by 2.30 hours
# pd.tseries.offsets
from pandas.tseries.offsets import BDay
pd.date_range('2015-07-01', periods=5, freq=BDay())
from pandas_datareader import data
goog = data.DataReader('GOOG', start='2004', end='2016',data_source='yahoo')
goog.head()
goog = goog['Close'] #interest only close price
goog.plot()
# resample() is aggregation datetime by hour, year , day, etc..
'''
All parameter
B business day frequency
C custom business day frequency (experimental)
D calendar day frequency
W weekly frequency
M month end frequency
SM semi-month end frequency (15th and end of month)
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
SMS semi-month start frequency (1st and 15th)
BMS business month start frequency
CBMS custom business month start frequency
Q quarter end frequency
BQ business quarter endfrequency
QS quarter start frequency
BQS business quarter start frequency
A year end frequency
BA, BY business year end frequency
AS, YS year start frequency
BAS, BYS business year start frequency
BH business hour frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
L, ms milliseconds
U, us microseconds
N nanoseconds
'''
goog.resample('W').mean() # aggregate data by week
# resample() method, or the much simpler asfreq()
# difference between the two is that resample() is fundamentally a data aggregation, while asfreq() is fundamentally a data selection.
goog.plot(alpha= 0.5, style= '-') # set style of line
goog.resample('BA').mean().plot(style=':')
goog.asfreq('BA').plot(style='--');
plt.legend(['input', 'resample', 'asfreq'],
loc='upper left')
# # Notice the difference: at each point, resample reports the average of the previous year,
# while asfreq reports the value at the end of the year.
The $.eval()$ function uses string expressions to efficiently compute operations using $DataFrame$
support
Arithmetic operators + - * /
Comparison operators > < >= !=
Bitwise operators |
Object attributes and indices. $df.loc[ ]$, $df[ ]$
nrows,ncols =100000,100
df1,df2,df3,df4 = (pd.DataFrame(np.random.rand(nrows,ncols))
for i in range(4))
df2.head()
#addition using typical sum
%timeit df1+df2+df3+df4
# using .eval() is faster
print(np.allclose(df1 + df2 + df3 + df4,pd.eval('df1 + df2 + df3 + df4'))) # check same thing?
%timeit pd.eval('df1 + df2 + df3 + df4')
#DataFrame.eval() for Column-Wise Operations
df = pd.DataFrame(np.random.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()
# treat column names as variable
result1 = df.A+df['B']/df.loc[:,'C']
result2 = df.eval('A+B/C')
(result2 == result1)
# use eval to create new column inplace
df.eval('D = (A+B)').head()
# also work with variable outside (locally)
column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean') # The @ character here marks a variable name rather than a column name,
np.allclose(result1, result2)
filtering operation
df[(df.A < 0.3) & (df.B < 0.3)]
pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
df.query('A<0.5 and B <0.5') # all this 3 is the same (easy to read)
multiplatform data visualization library built on NumPy arrays
One of Matplotlib’s most important features is its ability to play well with many operating
systems and graphics backends.
import matplotlib as mpl
import matplotlib.pyplot as plt
# set style
plt.style.use('classic')
%matplotlib inline
Using matplotlib in notebook (only to run once per kernal)
# sample
x = np.linspace(0,10,100)
fig = plt.figure()
plt.plot(x, np.sin(x), '-')
plt.plot(x, np.cos(x), '--')
plt.style.use('seaborn-whitegrid') # set style
# For all Matplotlib plots,
fig = plt.figure() # 1. creating a figure that contains all the objects representing axes, graphics, text, and labels.
ax = plt.axes() # 2. creating an axes contain the plot elements that make up our visualization
fig = plt.figure()
ax = plt.axes()
# use the ax.plot function to plot some data.
x = np.linspace(0,10,1000)
ax.plot(x,np.sin(x)); #; to end # use the ax.plot function to plot some data
# or use pylab so that fig and axes get created in the backgroud
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x));
Line Colors and Styles
# color
plt.plot(x, np.sin(x - 0), color='blue') # specify color by name
plt.plot(x, np.sin(x - 1), color='g') # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75') # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44') # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 and 1
plt.plot(x, np.sin(x - 5), color='chartreuse'); # all HTML color names supported
# line style
plt.plot(x, x + 4, linestyle='-') # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.') # dashdot
plt.plot(x, x + 7, linestyle=':'); # dotted
plt.xlim( ) and plt.ylim( ) methods
plt.plot(x, np.sin(x))
plt.xlim(-1, 11)
plt.ylim(-1.5, 5);
# The plt.axis() method can set the xand y limits with a single call, by passing a list [xmin, xmax, ymin,ymax]
plt.plot(x, np.sin(x))
plt.axis([-1,11,-1,2])
plt.plot(x, np.sin(x), label= 'sin(x)') # label for legend
plt.plot(x, np.cos(x), ':b', label='cos(x)') #label for legend
plt.title("A Sin Curve")
plt.xlabel("x") # lebel for x axes
plt.ylabel("sin(x)");
plt.legend(); #need for naming legend
use character $'o'$ for scatter plot
x = np.linspace(0, 10, 30)
y = np.sin(x)
plt.plot(np.sin(np.linspace(0, 10, 30)), 'o', color='black');
#The third argument in the function call is a character that represents the type of symbol used for the plotting.
# Just as you can specify options such as '-' and '--'
# additional line to conect
plt.plot(x, y, '-ok'); # line (-), circle marker (o), black (k)
# Demonstration of point numbers
rng = np.random.RandomState(0)
for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
plt.plot(rng.rand(5), rng.rand(5), marker,
label="marker='{0}'".format(marker))
more powerful method of creating scatter plot
it can be used to create
scatter plots where the properties of each individual point (size, face color, edge color,
etc.) can be individually controlled or mapped to data.
rng = np.random.RandomState(0)
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)
plt.scatter(x, y, c=colors, s=sizes, alpha=0.3,
cmap='viridis')
plt.colorbar(); # show color scale
# test with iris data
from sklearn.datasets import load_iris
iris = load_iris()
features = iris.data.T
# #simultaneously explore four different dimensions of the data: the (x, y) location of each point corresponds to
# the sepal length and width, the size of the point is the petal width, and the color is the particular species of flower
plt.scatter(features[0],features[1],
alpha = 0.2,
s= 100* features[3],
c=iris.target,
cmap='viridis')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1]);
plt.errorbar(x, y, yerr=dy, fmt='.k');
plt.style.use('seaborn-whitegrid')
#Visulize error (y) from point(x)
x = np.linspace(0,10,50)
dy =0.8
y= np.sin(x) +dy * np.random.randn(50)
plt.errorbar(x, y, yerr=dy, fmt='.k');
#make the errorbars lighter than the points themselves with ecolor='lightgray'
plt.errorbar(x, y, yerr=dy, fmt='o', color='black',
ecolor='lightgray', elinewidth=3, capsize=0);
data = np.random.randn(3000)
plt.hist(data);
# The hist() function has many options to tune both the calculation and the display;
plt.hist(data,bins = 30,density=True, alpha = 0.5,
histtype = 'stepfilled', color = 'steelblue',
edgecolor= 'none');
#use histtype='stepfilled' and transparency alpha to compare between distribution
x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)
kwargs = dict(histtype='stepfilled', alpha=0.3, density=True, bins=40)
plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs);
# count the number of points in a given bin and not display it, with np.histogram()
counts, bin_edges = np.histogram(data, bins=5)
print(counts)
create histograms in two-dimensions by dividing points among two-dimensional bins.
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 10000).T
# plt.hist2d function:
plt.hist2d(x,y,bins= 30, cmap ='Greens')
cb = plt.colorbar()
cb.set_label('coints in bin')
counts, xedges, yedges = np.histogram2d(x, y, bins=30)
evaluating densities or a way to "smear out" the points in space and add up the result to obtain a smooth function with scipy.stats
from scipy.stats import gaussian_kde
# fit an array of size [Ndim, Nsamples]
data = np.vstack([x, y])
kde = gaussian_kde(data)
# evaluate on a regular grid
xgrid = np.linspace(-3.5, 3.5, 40)
ygrid = np.linspace(-6, 6, 40)
Xgrid, Ygrid = np.meshgrid(xgrid, ygrid)
Z = kde.evaluate(np.vstack([Xgrid.ravel(), Ygrid.ravel()]))
# Plot the result as an image
plt.imshow(Z.reshape(Xgrid.shape),
origin='lower', aspect='auto',
extent=[-3.5, 3.5, -6, 6],
cmap='Blues')
cb = plt.colorbar()
cb.set_label("density")
Subplots: groups of smaller axes that can exist together within a single figure.
# create plt.axes([left, bottom, width, height])
fig = plt.figure()
ax1 = fig.add_axes([0.1, 0.5, 0.8, 0.4],
xticklabels=[], ylim=(-1.2, 1.2)) # top axes
ax2 = fig.add_axes([0.1, 0.1, 0.8, 0.4],
ylim=(-1.2, 1.2)) # bottom axes
x = np.linspace(0, 10)
ax1.plot(np.sin(x))
ax2.plot(np.cos(x));
plt.subplots() is the easier tool to use (note the s at the end of subplots). Rather than creating a single subplot, this function creates a full grid of subplots in a single line, returning them in a NumPy array.
Here we'll create a 2×3 grid of subplots, where all axes in the same row share their y-axis scale, and all axes in the same column share their x-axis scale:
fig, ax = plt.subplots(2, 3, sharex='col', sharey='row')
# axes are in a two-dimensional array, indexed by [row, col]
for i in range(2):
for j in range(3):
ax[i, j].text(0.5, 0.5, str((i, j)),
fontsize=18, ha='center')
fig
plt.GridSpec() object does not create a plot by itself; it is simply a convenient interface that is recognized by the plt.subplot() command. For example, a gridspec for a grid of two rows and three columns with some specified width and height space looks like this:
grid = plt.GridSpec(2, 3, wspace=0.4, hspace=0.3)
plt.subplot(grid[0, 0])
plt.subplot(grid[0, 1:])
plt.subplot(grid[1, :2])
plt.subplot(grid[1, 2]);
# Create some normally distributed data
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 3000).T
# Set up the axes with gridspec
fig = plt.figure(figsize=(6, 6))
grid = plt.GridSpec(4, 4, hspace=0.2, wspace=0.2)
main_ax = fig.add_subplot(grid[:-1, 1:])
y_hist = fig.add_subplot(grid[:-1, 0], xticklabels=[], sharey=main_ax)
x_hist = fig.add_subplot(grid[-1, 1:], yticklabels=[], sharex=main_ax)
# scatter points on the main axes
main_ax.plot(x, y, 'ok', markersize=3, alpha=0.2)
# histogram on the attached axes
x_hist.hist(x, 40, histtype='stepfilled',
orientation='vertical', color='gray')
x_hist.invert_yaxis()
y_hist.hist(y, 40, histtype='stepfilled',
orientation='horizontal', color='gray')
y_hist.invert_xaxis()
eaborn provides an API on top of Matplotlib that offers sane choices for plot style and color defaults and integrates with the functionality provided by Pandas DataFrames.
# just as these two lines
import seaborn as sns
sns.set()
data = np.random.multivariate_normal([0,0], [[5,2],[2,2]], size=2000)
data = pd.DataFrame(data, columns= ['x','y'])
for col in 'xy':
plt.hist(data[col], density=True, alpha = 0.5)
# get a smooth estimate of the distribution using a kernel density estimation
for col in 'xy':
sns.kdeplot(data[col], shade=True)
# Histograms and KDE can be combined using distplot:
sns.distplot(data['x'])
sns.distplot(data['y']);
# pass two dimensional dataset
with sns.axes_style('white'):
sns.jointplot("x", "y", data, kind='kde');
This is very useful for exploring correlations between multidimensional data,
# Pair plots
iris = sns.load_dataset("iris")
iris.head()
sns.pairplot(iris, hue='species', height = 3);
tips = sns.load_dataset('tips')
tips.head()
# shows the amount that restaurant staff receive in tips based on various indicator data:
tips['tip_pct'] = 100 * tips['tip'] / tips['total_bill']
grid = sns.FacetGrid(tips, row="sex", col="time", margin_titles=True)
grid.map(plt.hist, "tip_pct", bins=np.linspace(0, 40, 15));
planets = sns.load_dataset('planets')
planets.head()
with sns.axes_style('white'):
g = sns.catplot("year", data=planets, aspect=2,
kind="count", color='steelblue')
g.set_xticklabels(step=5)
# looking at method and year and count
with sns.axes_style('white'):
g = sns.factorplot("year", data=planets, aspect=4.0, kind='count',
hue='method', order=range(2001, 2015))
g.set_ylabels('Number of Planets Discovered')
using Seaborn to help visualize and understand finishing results from a marathon.
# download the data
!curl -O https://raw.githubusercontent.com/jakevdp/marathon-data/master/marathon-data.csv
data = pd.read_csv('marathon-data.csv')
data.head()
By default, Pandas loaded the time columns as Python strings (type object);
We need to fix by import again with function to change to timedelta
pd.Timedelta(hours=1, minutes = 75,seconds=1)
def convert_time(s):
h, m, s = map(int, s.split(':'))
return pd.Timedelta(hours=h, minutes = m,seconds=s)
data = pd.read_csv('marathon-data.csv',
converters={'split':convert_time, 'final':convert_time})
data.dtypes
# For the purpose of ploting, let's next add columns that give the times in seconds:
data['split_sec'] = data['split']/np.timedelta64(1,'ms') / 1E3
data['final_sec'] = data['final']/np.timedelta64(1,'ms')/ 1E3
data.head()
# split time is time pass half way
with sns.axes_style('white'):
g = sns.jointplot("split_sec", "final_sec", data, kind='hex')
g.ax_joint.plot(np.linspace(4000, 16000),
np.linspace(8000, 32000), ':k')
The dotted line shows where someone's time would lie if they ran the marathon using same time between first half and second half.
The fact that the distribution lies above this indicates (as you might expect) that most people slow down over the course of the marathon (second half use more time). If you have run competitively, you'll know that those who do the opposite—run faster during the second half of the race—are said to have "negative-split" the race.
#create another column, the split fraction, which measures
# each runner negative-splits or positive-splits the race:
data['split_frac'] = 1 - 2 * data['split_sec'] / data['final_sec']
data.head(3)
# Let's see whether there is any correlation between this split fraction and other variables. We'll do this using a pairgrid,
g = sns.PairGrid(data, vars=['age', 'split_sec', 'final_sec', 'split_frac'],
hue='gender', palette='RdBu_r')
g.map(plt.scatter, alpha=0.8)
g.add_legend();
# The difference between men and women here is interesting. Let's look at the histogram of split fractions for these two groups:
sns.kdeplot(data.split_frac[data.gender=='M'], label='men', shade=True)
sns.kdeplot(data.split_frac[data.gender=='W'], label='women', shade=True)
plt.xlabel('split_frac');
# There are many more men than women who are running close to an even split!
# Back to the men with negative splits: who are these runners? Does this split fraction correlate with finishing quickly
g = sns.lmplot('final_sec', 'split_frac', col='gender', data=data,
markers=".", scatter_kws=dict(color='c'))
g.map(plt.axhline, y=0.1, color="k", ls=":");
Machine learning involves building statistical models to help
understand data. “Learning” enters the fray when we give these models tunable
parameters that can be adapted to observed data; in this way the program can be considered
to be “learning” from the data.
Once these models have been fit to previously
seen data, they can be used to predict and understand aspects of newly observed data.
Category
# refer the rows of the matrix as samples, and the number of rows as n_samples.
# refer the columns of the matrix as features, and the number of columns as n_features.
iris = sns.load_dataset('iris')
iris.head() # species as a target array or outcomes
sns.set()
sns.pairplot(iris, hue='species', size=1.5);
# extract features and outcome from data set as x and y
X_iris = iris.drop('species',axis='columns')
y_iris = iris['species']
X_iris.shape , y_iris.shape
Basics of the API the steps in using the Scikit-Learn estimator API are as follows
Apply the model to new data:
• For supervised learning, often we predict labels for unknown data using the predict( ) method.
• For unsupervised learning, we often transform or infer properties of the data using the transform( ) or predict( ) method.
# Sample with Linear regression
rng = np.random.RandomState(42)
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
plt.scatter(x, y);
# in this case choose linear regression by import the linear regression class:
from sklearn.linear_model import LinearRegression
might need to answer one or more questions like the following:
• Would we like to fit for the offset (i.e., intercept)?
• Would we like the model to be normalized?
• Would we like to preprocess our features to add model flexibility?
• What degree of regularization would we like to use in our model?
• How many model components would we like to use?
These choices are often represented as hyperparameters, or parameters that must be set before the model is fit to data
# For linear regression, we can instantiate the LinearRegression class and
# fit the intercept using the fit_intercept hyperparameter:
model = LinearRegression(fit_intercept=True)
model
# the only action here is the storing of these hyperparameter values
Y (outcome, target array) must be in form (n_samples, )
X (features, predictors) must be in form (n_samples, n_features)
# change x form
X = x[:,np.newaxis] # transpose
X.shape
This can be done with the fit( ) method
model.fit(X,y)
# get data from fitted model -- model.<TAB>
model.coef_
using the predict( ) method
# create sample data fro unknown outcomes
xfit = np.linspace(-1, 11)
# change form
Xfit = xfit[:,np.newaxis]
# find outcome from new data
yfit = model.predict(Xfit)
# lets see the fitted line
plt.scatter(x, y)
plt.plot(xfit, yfit); # fitted line
# 1. split data for training and validate with train_test_split
from sklearn.model_selection import train_test_split
Xtrain , Xtest, ytrain, ytest = train_test_split(X_iris, y_iris , random_state =1)
print(Xtrain.shape) ;print(Xtest.shape) ;print(ytrain.shape) ;print(ytest.shape) ;
# go through step
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB() # 2. instantiate model
model.fit(Xtrain, ytrain) # 3. fit model to data
y_model = model.predict(Xtest) # 4. predict on new data
# evaluate model with accuracy_score
from sklearn.metrics import accuracy_score
print('accuracy score of', accuracy_score(ytest,y_model))
# classification report evaluate , print(classification_report(Y_actual,Y_predict))
from sklearn.metrics import classification_report
print(classification_report(ytest,y_model))
# confusion matrix, pd.crosstab(Y_predict Y_actual, rownames=['Predicted'], colnames=['Actual'], margins=True)
pd.crosstab(y_model,ytest, rownames=['Predicted'], colnames=['Actual'], margins=True)
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest, y_model)
sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value');
# useful summary report for regression model , regression_results(Y_actual,Y_predict)
import sklearn.metrics as metrics
def regression_results(y_true, y_pred):
# Regression metrics
explained_variance=metrics.explained_variance_score(y_true, y_pred)
mean_absolute_error=metrics.mean_absolute_error(y_true, y_pred)
mse=metrics.mean_squared_error(y_true, y_pred)
mean_squared_log_error=metrics.mean_squared_log_error(y_true, y_pred)
median_absolute_error=metrics.median_absolute_error(y_true, y_pred)
r2=metrics.r2_score(y_true, y_pred)
print('explained_variance: ', round(explained_variance,4))
print('mean_squared_log_error: ', round(mean_squared_log_error,4))
print('r2: ', round(r2,4))
print('MAE: ', round(mean_absolute_error,4))
print('MSE: ', round(mse,4))
print('RMSE: ', round(np.sqrt(mse),4))
#report for intercept and coefficient
params = pd.Series(model.coef_, index=X.columns)
params
# estimate effect and error from bootstrap resample
from sklearn.utils import resample
np.random.seed(1)
err = np.std([model.fit(*resample(X, y)).coef_
for i in range(1000)], 0)
# With these errors estimated, let's again look at the results:
print(pd.DataFrame({'effect': params.round(0),
'error': err.round(0)}))
reducing the
dimensionality of the Iris data so as to more easily visualize it.
from 4 dimension we will ask PCA to return only 2 components
from sklearn.decomposition import PCA # 1. Choose the model class
model = PCA(n_components=2) # 2. Instantiate the model with hyperparameters
model.fit(X_iris) # 3. Fit to data. Notice y is not specified!
X_2D = model.transform(X_iris) # 4. Transform the data to two dimensions
# let’s plot the results.
iris['PCA1'] = X_2D[:, 0]
iris['PCA2'] = X_2D[:, 1]
sns.lmplot("PCA1", "PCA2", hue='species', data=iris, fit_reg=False);
from sklearn.mixture import GaussianMixture # 1. Choose the model class
model = GaussianMixture(n_components=3,covariance_type='full') # 2. Instantiate the model w/ hyperparameters
model.fit(X_iris) # 3. Fit to data. Notice y is not specified!
y_gmm = model.predict(X_iris) # 4. Determine cluster labels
iris['cluster'] = y_gmm
sns.lmplot("PCA1", "PCA2", data=iris, hue='species',col='cluster', fit_reg=False);
# sample some data
from sklearn import datasets
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape
using the train_test_split utility in Scikit-Learn to split data into training and validating
The holdout set is similar to unknown data, because the
model has not “seen” it before.
from sklearn.model_selection import train_test_split
# split the data with 50% in each set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
# fit the model on one set of data
model = GaussianNB()
model.fit(X_train, y_train)
# evaluate the model on the second set of data
y2_fitted = model.predict(X_test)
accuracy_score(y_test, y2_fitted)
One disadvantage of using a holdout set for model validation is that we have lost a
portion of our data to the model training.
cross-validation is, to do a sequence of fits where each subset of the data is used both as a training set and as a validation set.
# Using the split data from before, we could implement it like this:
y2_model = model.fit(X_train, y_train).predict(X_test)
y1_model = model.fit(X_test, y_test).predict(X_train)
accuracy_score(y1, y1_model), accuracy_score(y_test, y2_fitted)
# we could average both to get a better measure of the global model performance.
# 5-fold cross validation,we can use Scikit-Learn’s cross_val_score convenience routine to do itsuccinctly:
from sklearn.model_selection import cross_val_score
cross_val_score(model, X, y, cv=5) # then take the mean score
if our estimator is underperforming how should we move forward?
• Use a more complicated/more flexible model
• Use a less complicated/less flexible model
• Gather more training samples
• Gather more data to add features to each sample
# # Here we will use a polynomial regression model: this is a generalized
# linear model in which the degree of the polynomial is a tunable parameter
# ie, defree 1 is linear and A degree-3 polynomial fits a cubic curve to the data; y = ax^3 + bx^2 + cx + d
from sklearn.preprocessing import PolynomialFeatures #polynomial
from sklearn.linear_model import LinearRegression #linear regression
from sklearn.pipeline import make_pipeline #use a pipeline to string these operations together
def PolynomialRegression(degree=2, **kwargs):
return make_pipeline(PolynomialFeatures(degree),
LinearRegression(**kwargs))
# create some data to which we will fit our model
def make_data(N, err=1.0, rseed=1):
# randomly sample the data
rng = np.random.RandomState(rseed)
X = rng.rand(N, 1) ** 2
y = 10 - 1. / (X.ravel() + 0.1)
if err > 0:
y += err * rng.randn(N)
return X, y
X, y = make_data(40)
X_test = np.linspace(-0.1, 1.1, 500)[:, None]
# visualize our data, along with polynomial fits of several degrees
plt.scatter(X.ravel(), y, color='black')
axis = plt.axis()
for degree in [1, 3, 5]:
y_test = PolynomialRegression(degree).fit(X, y).predict(X_test)
plt.plot(X_test.ravel(), y_test, label='degree={0}'.format(degree))
plt.xlim(-0.1, 1.0)
plt.ylim(-2, 12)
plt.legend(loc='best');
# A useful question to answer is this: what degree of polynomial provides a suitable
# trade-off between bias (underfitting) and variance (overfitting)?
# decided by visualizing the validation curve for this particular
# data and model; we can do this straightforwardly using the validation_curve
from sklearn.model_selection import validation_curve
degree = np.arange(0, 21)
# Given a model, data, parameter name, and a range to explore, this function will automatically
# compute both the training score and validation score across the range
train_score, val_score = validation_curve(PolynomialRegression(), X, y,'polynomialfeatures__degree',degree, cv=7)
plt.plot(degree, np.median(train_score, 1), color='blue', label='training score')
plt.plot(degree, np.median(val_score, 1), color='red', label='validation score')
plt.legend(loc='best')
plt.ylim(0, 1)
plt.xlabel('degree')
plt.ylabel('score');
# we can see that degree 3 gives the best training and validation score
# we can compute degree 3 and display this fit over the original data as follows
plt.scatter(X.ravel(), y)
lim = plt.axis()
y_test = PolynomialRegression(3).fit(X, y).predict(X_test)
plt.plot(X_test.ravel(), y_test);
plt.axis(lim);
# Try with bigger data
X2, y2 = make_data(200)
plt.scatter(X2.ravel(), y2);
# try validation curve for this larger dataset;
egree = np.arange(21)
train_score2, val_score2 = validation_curve(PolynomialRegression(), X2, y2,'polynomialfeatures__degree',degree, cv=7)
plt.plot(degree, np.median(train_score2, 1), color='blue',label='training score')
plt.plot(degree, np.median(val_score2, 1), color='red', label='validation score')
plt.plot(degree, np.median(train_score, 1), color='blue', alpha=0.3,linestyle='dashed')
plt.plot(degree, np.median(val_score, 1), color='red', alpha=0.3,linestyle='dashed')
plt.legend(loc='lower center')
plt.ylim(0, 1)
plt.xlabel('degree')
plt.ylabel('score');
#The solid lines show the new results, while the fainter dashed lines show the results of the previous smaller dataset.
# It is clear from the validation curve that the larger dataset can support a much more complicated model
# more data help reduce overfit for complex model because validatopn score keep converge to training
# but once you have enough points and a model has converged, adding more training data will not help you!
from sklearn.model_selection import learning_curve
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)
for i, degree in enumerate([2, 9]):
N, train_lc, val_lc = learning_curve(PolynomialRegression(degree),X, y, cv=7,train_sizes=np.linspace(0.3, 1, 25))
ax[i].plot(N, np.mean(train_lc, 1), color='blue', label='training score')
ax[i].plot(N, np.mean(val_lc, 1), color='red', label='validation score')
ax[i].hlines(np.mean([train_lc[-1], val_lc[-1]]), N[0], N[-1], color='gray',linestyle='dashed')
ax[i].set_ylim(0, 1)
ax[i].set_xlim(N[0], N[-1])
ax[i].set_xlabel('training size')
ax[i].set_ylabel('score')
ax[i].set_title('degree = {0}'.format(degree), size=14)
ax[i].legend(loc='best')
#it gives us a visual depiction of how our model responds to increasing training data.
In muti-dimension cases, visualizations are difficult and we would rather simply find the particular model that maximizes the validation score.
from sklearn.model_selection import GridSearchCV
param_grid = {'polynomialfeatures__degree': np.arange(21),
'linearregression__fit_intercept': [True, False],
'linearregression__normalize': [True, False]}
grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
grid.fit(X, y);
grid.best_params_
model = grid.best_estimator_
plt.scatter(X.ravel(), y)
lim = plt.axis()
y_test = model.fit(X, y).predict(X_test)
plt.plot(X_test.ravel(), y_test);
plt.axis(lim);
taking whatever information you have about your problem and turning it into numbers that you can use to build your feature matrix.
# explore room information will have neriborhood as categorical
data = [{'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
{'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
{'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
{'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}]
# use one-hot encoding with dummy variables of 0 and 1
# with Scikit-Learn’s DictVectorizer
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)
# see the meaning of each column
vec.get_feature_names()
One of the simplest methods of encoding data is by word counts: you take each snippet of text, count the occurrences of each word within it, and put the results in a table.
sample = ['problem of evil',
'evil queen',
'horizon problem']
# construct a column representing each word “problem,” the word “evil,”. etc so on using Scikit-Learn’s CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
# better approach is term frequency–inverse document frequency (TF–IDF),
# which weights the word counts by a measure of how often they appear in the documents.
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
need to first replace such missing data with some appropriate fill value with Scikit-Learn Imputer
X = np.array([ [ np.nan, 0, 3 ],
[ 3, 7, 9 ],
[ 3, 5, 2 ],
[ 4, np.nan, 6 ],
[ 8, 8, 1 ]])
y = np.array([14, 16, -1, 8, -5])
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
X2 = imp.fit_transform(X)
X2 #the two missing values have been replaced with the mean of the remaining values in the column.
Scikit-Learn provides a pipeline object, which can be used as follows:
from sklearn.pipeline import make_pipeline
model = make_pipeline(SimpleImputer(strategy='mean'),PolynomialFeatures(degree=2),LinearRegression())
# This pipeline looks and acts like a standard Scikit-Learn object, and will apply all the specified steps to any input data
model.fit(X, y) # X with missing values, from above
print(y)
print(model.predict(X))
Naive Bayes models are a group of extremely fast and simple classification algorithms
that are often suitable for very high-dimensional datasets.
Because they are so fast
and have so few tunable parameters, they end up being very useful as a quick-anddirty
baseline for a classification problem.
the assumption is that data from each label is drawn from a simple Gaussian distribution. (independent)
We can
fit this model by simply finding the mean and standard deviation of the points within
each labe
from sklearn.datasets import make_blobs
X, y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=1.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu');
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y);
# generate some new data and predict the label:
rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)
ynew = model.predict(Xnew)
# see the data boundary
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')
lim = plt.axis()
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='RdBu', alpha=0.1)
plt.axis(lim);
# Compute probability of first and second label
yprob = model.predict_proba(Xnew)
yprob[-8:].round(2)
multinomial naive Bayes, where the features are assumed to be generated
from a simple multinomial distribution.
It describes the
probability of observing counts among a number of categories, and thus multinomial
naive Bayes is most appropriate for features that represent counts or count rates.
# Example : Classify text
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
data.target_names
# get train and test set
categories = ['talk.religion.misc', 'soc.religion.christian', 'sci.space','comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
# sample data in dict form
print(train.data[5])
# we will use the TF–IDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(train.data, train.target)
labels = model.predict(test.data)
# evaluate with confusion_matrix
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(test.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');
# create function for predict whether this phrase is about?
def predict_category(s, train=train, model=model):
pred = model.predict([s])
return train.target_names[pred[0]]
# Lets try predict
print(predict_category('sending a payload to the ISS'))
print(predict_category('determining the screen resolution'))
Because naive Bayesian classifiers make such stringent assumptions about data, they will generally not perform as well as a more complicated model. That said, they have several advantages:
These advantages mean a naive Bayesian classifier is often a good choice as an initial baseline classification. If it does not perform well, then begin exploring more sophisticated models, with some baseline knowledge of how well they should perform.
Naive Bayes classifiers tend to perform especially well in one of the following situations:
Naive Bayes tend to work as well or better than more complicated classifiers as the dimensionality grows: once you have enough data, even a simple model can be very powerful.
# sample
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = 2 * x - 5 + rng.randn(50)
plt.scatter(x, y);
x = x[:,np.newaxis]
# use Scikit-Learn’s LinearRegression estimator to fit this data
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
# fit data
model.fit(x,y)
# create new data and find y_fitted
xfit= np.linspace(0,10,1000)[:,np.newaxis]
yfit = model.predict(xfit)
# plot
plt.scatter(x,y)
plt.plot(xfit,yfit);
# print relevant parameter
print("Model slope: ", model.coef_[0])
print("Model intercept:", model.intercept_)
penalizing large values of the model parameters. Such a penalty is known as regularization, and comes in several forms.
Ridge regression (L2 regularization)
Lasso regularization (L1)
# Ridge L2
from sklearn.linear_model import Ridge
model = make_pipeline(GaussianFeatures(30), Ridge(alpha=0.1))
basis_plot(model, title='Ridge Regression')
# L1
from sklearn.linear_model import Lasso
model = make_pipeline(GaussianFeatures(30), Lasso(alpha=0.001))
basis_plot(model, title='Lasso Regression')
predict the number of bicycle trips across Seattle’s Fremont Bridge based on weather, season, and other factors.
# start by load data
!curl -o FremontBridge.csv https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD
counts = pd.read_csv('FremontBridge.csv', index_col='Date', parse_dates=True)
weather = pd.read_csv('data/BicycleWeather.csv', index_col='DATE', parse_dates=True)
# compute the total daily bicycle traffic
daily = counts.resample('d').sum()
daily['Total'] = daily.sum(axis=1)
daily = daily[['Total']] # remove other columns
daily.head(3)
# We saw that the patterns of use generally vary from day to day
sns.set_style("darkgrid")
sns.distplot(daily);
# let's account for this in our data by adding binary columns that indicate the day of the week:
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
for i in range(7):
daily[days[i]] = (daily.index.dayofweek == i).astype(float)
daily.head(3)
# Similarly, we might expect riders to behave differently on holidays; let's add an indicator of this as well:
from pandas.tseries.holiday import USFederalHolidayCalendar
cal = USFederalHolidayCalendar()
holidays = cal.holidays('2012', '2016')
daily = daily.join(pd.Series(1, index=holidays, name='holiday'))
daily['holiday'].fillna(0, inplace=True)
daily.head(3)
#We also might suspect that the hours of daylight would affect how many people ride;
# let's use the standard astronomical calculation to add this information:
def hours_of_daylight(date, axis=23.44, latitude=47.61):
"""Compute the hours of daylight for the given date"""
days = (date - pd.datetime(2000, 12, 21)).days
m = (1. - np.tan(np.radians(latitude))
* np.tan(np.radians(axis) * np.cos(days * 2 * np.pi / 365.25)))
return 24. * np.degrees(np.arccos(1 - np.clip(m, 0, 2))) / 180.
daily['daylight_hrs'] = list(map(hours_of_daylight, daily.index))
daily[['daylight_hrs']].plot()
plt.ylim(8, 17)
#We can also add the average temperature and total precipitation to the data.
# In addition to the inches of precipitation, let's add a flag that indicates whether a day is dry (has zero precipitation):
weather['Temp (C)'] = 0.5 * (weather['TMIN'] + weather['TMAX'])*10000 # average min and max temp * make it normal
# precip is in 1/10 mm; convert to inches
weather['PRCP'] /= 254 # precip
weather['dry day'] = (weather['PRCP'] == 0).astype(int)
daily = daily.join(weather[['PRCP', 'Temp (C)', 'dry day']])
daily.head(3)
# Let's add a counter that increases from day 1, and measures how many years have passed.
# This will let us measure any observed annual increase or decrease in daily crossings:
daily['annual'] = (daily.index - daily.index[0]).days / 365.
daily.head()
# Drop any rows with null values
# choose the columns to use, and fit a linear regression model to our data. We will set fit_intercept = False,
# because the daily flags essentially operate as their own day-specific intercepts:
daily.dropna(axis=0, how='any', inplace=True)
column_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'holiday','daylight_hrs', 'PRCP', 'dry day', 'Temp (C)', 'annual']
X = daily[column_names]
y = daily['Total']
# fit model
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=False)
model.fit(X, y)
daily['predicted'] = model.predict(X)
# Finally, we can compare the total and predicted bicycle traffic visually:
daily[['Total', 'predicted']].plot(alpha=0.5);
It is evident that we have missed some key features, especially during the summer time.
Either our features are not complete (i.e., people decide whether to ride to work based on more than just these) or there are some nonlinear relationships that we have failed to take into account (e.g., perhaps people ride less at both high and low temperatures).
Nevertheless, our rough approximation is enough to give us some insights, and we can take a look at the coefficients of the linear model to estimate how much each feature contributes to the daily bicycle count:
params = pd.Series(model.coef_, index=X.columns)
params
# These numbers are difficult to interpret without some measure of their uncertainty.
# We can compute these uncertainties quickly using bootstrap resamplings of the data:
from sklearn.utils import resample
np.random.seed(1)
err = np.std([model.fit(*resample(X, y)).coef_
for i in range(1000)], 0)
# With these errors estimated, let's again look at the results:
print(pd.DataFrame({'effect': params.round(0),
'error': err.round(0)}))
We first see that there is a relatively stable trend in the weekly baseline: there are many more riders on weekdays than on weekends and holidays.
We see that for each additional hour of daylight, 270 ± 19 more people choose to ride; a temperature increase of one degree Celsius encourages 126 ± 8 people to grab their bicycle; a dry day means an average of 1461 ± 62 more riders, and each inch of precipitation means 0 ± 0 more people leave their bike at home.
Once all these effects are accounted for, we see a modest increase of 52 ± 37 new daily riders each year.
Our model is almost certainly missing some relevant information. For example, nonlinear effects (such as effects of precipitation and cold temperature) and nonlinear trends within each variable (such as disinclination to ride at very cold and very hot temperatures) cannot be accounted for in this model.
Additionally, we have thrown away some of the finer-grained information (such as the difference between a rainy morning and a rainy afternoon), and we have ignored correlations between days (such as the possible effect of a rainy Tuesday on Wednesday's numbers, or the effect of an unexpected sunny day after a streak of rainy days). These are all potentially interesting effects, and you now have the tools to begin exploring them if you wish!
While Bayesian classification describing the distribution of each underlying class, and used these generative models to probabilistically determine labels for new points. That was an example of generative classification;
Here we will consider instead discriminative classification: rather than modeling each class, we simply find a line or curve (in two dimensions) or manifold (in multiple dimensions) that divides the classes from each other.
# sample with well distance 2 class of dataset
from scipy import stats
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=50, centers=2,
random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn');
A linear discriminative classifier would attempt to draw a straight line separating the two sets of data, and thereby create a model for classification.
In Support Vector Machines, we can draw around each line a margin of some width, up to the nearest point
xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
yfit = m * xfit + b
plt.plot(xfit, yfit, '-k')
plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none', color='#AAAAAA',alpha=0.4)
plt.xlim(-1, 3.5);
In support vector machines, the line that maximizes this margin is the one we will choose as the optimal model. Support vector machines are an example of such a maximum margin estimator.
Use Scikit-Learn's support vector classifier to train an SVM model on this data. For the time being, we will use a linear kernel and set the C parameter to a very large number
from sklearn.svm import SVC # "Support vector classifier"
model = SVC(kernel='linear', C=1E10)
model.fit(X, y)
To better visualize what's happening here, let's create a quick convenience function that will plot SVM decision boundaries for us:
def plot_svc_decision_function(model, ax=None, plot_support=True):
"""Plot the decision function for a 2D SVC"""
if ax is None:
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
# create grid to evaluate model
x = np.linspace(xlim[0], xlim[1], 30)
y = np.linspace(ylim[0], ylim[1], 30)
Y, X = np.meshgrid(y, x)
xy = np.vstack([X.ravel(), Y.ravel()]).T
P = model.decision_function(xy).reshape(X.shape)
# plot decision boundary and margins
ax.contour(X, Y, P, colors='k',
levels=[-1, 0, 1], alpha=0.5,
linestyles=['--', '-', '--'])
# plot support vectors
if plot_support:
ax.scatter(model.support_vectors_[:, 0],
model.support_vectors_[:, 1],
s=300, linewidth=1, facecolors='none');
ax.set_xlim(xlim)
ax.set_ylim(ylim)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(model);
This is the dividing line that maximizes the margin between the two sets of points. Notice that a few of the training points just touch the margin: they are indicated by the black circles in this figure. These points are the pivotal elements of this fit, and are known as the support vectors, and give the algorithm its name. In Scikit-Learn, the identity of these points are stored in the supportvectors attribute of the classifier:
model.support_vectors_
Where SVM becomes extremely powerful is when it is combined with kernels. such as polynomials in LinearRegression
# let's look at some data that is not linearly separable:
from sklearn.datasets import make_circles
X, y = make_circles(100, factor=.1, noise=.1)
clf = SVC(kernel='linear').fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(clf, plot_support=False);
Project the data into a higher dimension to compute a radial basis function centered on the middle clump:
r = np.exp(-(X ** 2).sum(1))
%matplotlib inline
from mpl_toolkits import mplot3d
def plot_3D(elev=30, azim=30, X=X, y=y):
ax = plt.subplot(projection='3d')
ax.scatter3D(X[:, 0], X[:, 1], r, c=y, s=50, cmap='autumn')
ax.view_init(elev=elev, azim=azim)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('r')
interact(plot_3D, elev=[25,10], azip=(-180, 180),X=fixed(X), y=fixed(y));
We can see that with this additional dimension, the data becomes trivially linearly separable, by drawing a separating plane at, say, r=0.7.
we would like to somehow automatically find the best basis functions to use.
One strategy to this end is to compute a basis function centered at every point in the dataset, and let the SVM algorithm sift through the results. This type of basis function transformation is known as a kernel transformation, as it is based on a similarity relationship (or kernel) between each pair of points.
clf = SVC(kernel='rbf', C=1E6)
clf.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(clf)
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=300, lw=1, facecolors='none');
# where data is blended
X, y = make_blobs(n_samples=100, centers=2,random_state=0, cluster_std=1.2)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn');
To handle this case, the SVM implementation has a bit of a fudge-factor which "softens" the margin: that is, it allows some of the points to creep into the margin if that allows a better fit.
For very large $C$, the margin is hard, and points cannot lie in it. For smaller $C$, the margin is softer, and can grow to encompass some points.
# Visual picture of how a changing $C$ parameter affects the final fit
X, y = make_blobs(n_samples=100, centers=2,random_state=0, cluster_std=0.8)
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)
for axi, C in zip(ax, [10.0, 0.1]):
model = SVC(kernel='linear', C=C).fit(X, y)
axi.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(model, axi)
axi.scatter(model.support_vectors_[:, 0],
model.support_vectors_[:, 1],
s=300, lw=1, facecolors='none');
axi.set_title('C = {0:.1f}'.format(C), size=14)
take a look at the facial recognition problem
# import data
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)
fig, ax = plt.subplots(3, 5)
for i, axi in enumerate(ax.flat):
axi.imshow(faces.images[i], cmap='bone')
axi.set(xticks=[], yticks=[],
xlabel=faces.target_names[faces.target[i]])
# use a PCA to extract 150 fundamental components to feed into our support vector machine classifier.
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
pca = PCA(n_components=150, whiten=True, random_state=42)
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)
# split data to train and test
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(faces.data, faces.target,random_state=42)
# use a grid search cross-validation to explore combinations of parameters
from sklearn.model_selection import GridSearchCV
param_grid = {'svc__C': [1, 5, 10, 50],
'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]}
grid = GridSearchCV(model, param_grid)
%time grid.fit(Xtrain, ytrain)
print(grid.best_params_)
model = grid.best_estimator_
yfit = model.predict(Xtest)
# see predicted image
fig, ax = plt.subplots(4, 6)
for i, axi in enumerate(ax.flat):
axi.imshow(Xtest[i].reshape(62, 47), cmap='bone')
axi.set(xticks=[], yticks=[])
axi.set_ylabel(faces.target_names[yfit[i]].split()[-1],
color='black' if yfit[i] == ytest[i] else 'red')
fig.suptitle('Predicted Names; Incorrect Labels in Red', size=14);
# see classification report
from sklearn.metrics import classification_report
print(classification_report(ytest, yfit,
target_names=faces.target_names))
#This helps us get a sense of which labels are likely to be confused by the estimator.
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest, yfit)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=faces.target_names,
yticklabels=faces.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');
We have seen here a brief intuitive introduction to the principals behind support vector machines. These methods are a powerful classification method for a number of reasons:
However, SVMs have several disadvantages as well
The binary splitting makes this extremely efficient: in a well-constructed tree, each question will cut the number of options by approximately half
each node in the tree splits the data into two groups using a cutoff value within one of the features
# sample data with four class
from sklearn.datasets import make_blobs
plt.style.use('dark_background')
X, y = make_blobs(n_samples=300, centers=4,random_state=0, cluster_std=1.0)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow');
# Scikit-Learn with the DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier().fit(X, y)
# help visualize
def visualize_classifier(model, X, y, ax=None, cmap='rainbow'):
ax = ax or plt.gca()
# Plot the training points
ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=cmap,
clim=(y.min(), y.max()), zorder=3)
ax.axis('tight')
ax.axis('off')
xlim = ax.get_xlim()
ylim = ax.get_ylim()
# fit the estimator
model.fit(X, y)
xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
np.linspace(*ylim, num=200))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
# Create a color plot with the results
n_classes = len(np.unique(y))
contours = ax.contourf(xx, yy, Z, alpha=0.3,
levels=np.arange(n_classes + 1) - 0.5,
cmap=cmap, clim=(y.min(), y.max()),zorder=1)
ax.set(xlim=xlim, ylim=ylim)
visualize_classifier(DecisionTreeClassifier(), X, y)
it is very easy to go too deep in the tree, and thus to fit details of the particular data rather than the overall properties of the distributions
Each Tree models are inconsistent so combine information from many trees = better result!
multiple overfitting estimators can be combined to reduce the effect of this overfitting = Bagging
#bagging classification can be done manually using Scikit-Learn's BaggingClassifier meta-estimatorr
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
tree = DecisionTreeClassifier()
bag = BaggingClassifier(tree, n_estimators=100, max_samples=0.8,random_state=1) # 100 trees
bag.fit(X, y)
visualize_classifier(bag, X, y)
In Scikit-Learn, such an optimized ensemble of randomized decision trees is implemented in the RandomForestClassifier estimator,
which takes care of all the randomization automatically. All you need to do is select a number of estimators
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10000, random_state=0)
visualize_classifier(model, X, y);
# Consider the following data, drawn from the combination of a fast and slow oscillation:
rng = np.random.RandomState(42)
x = 10 * rng.rand(200)
def model(x, sigma=0.3):
fast_oscillation = np.sin(5 * x)
slow_oscillation = np.sin(0.5 * x)
noise = sigma * rng.randn(len(x))
return slow_oscillation + fast_oscillation + noise
y = model(x)
plt.errorbar(x, y, 0.3, fmt='o');
# Using the random forest regressor, we can find the best fit curve as follows:
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(200)
forest.fit(x[:, None], y)
xfit = np.linspace(0, 10, 1000)
yfit = forest.predict(xfit[:, None])
ytrue = model(xfit, sigma=0)
plt.errorbar(x, y, 0.3, fmt='o', alpha=0.5)
plt.plot(xfit, yfit, '-r');
plt.plot(xfit, ytrue, '-k', alpha=0.5);
from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()
# Visualize data
# set up the figure
fig = plt.figure(figsize=(6, 6)) # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
# plot the digits: each image is 8x8 pixels
for i in range(64):
ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
# label the image with the target value
ax.text(0, 7, str(digits.target[i]))
# We can quickly classify the digits using a random forest as follows:
from sklearn.model_selection import train_test_split
Xtrain , Xtest, ytrain , ytest = train_test_split(digits.data, digits.target, random_state=0)
model =RandomForestClassifier(n_estimators=1000)
model.fit(Xtrain,ytrain)
ypred = model.predict(Xtest)
from sklearn import metrics
print(metrics.classification_report(ypred, ytest))
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest, ypred)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
# We find that a simple, untuned random forest results in a very accurate classification of the digits data.
Random forests are a powerful method with several advantages:
PCA is a dimensionality reduction algorithm, but also useful tool for visualization, for noise filtering, for feature extraction and engineering
plt.style.use('classic')
#sample some data
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2,2) , rng.randn(2,200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');
Unsupervised learning attempts to learn about the relationship between the x and y values.
In PCA, this relationship is quantified by finding a list of the principal axes in the data, and using those axes to describe the dataset.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
print(pca.components_)
print(pca.explained_variance_)
#visualize using the components as direction vector and explained variance to define square-length of the vector
def draw_vector(v0, v1, ax=None):
ax = ax or plt.gca()
arrowprops=dict(arrowstyle='->',
linewidth=2,
shrinkA=0, shrinkB=0)
ax.annotate('', v1, v0, arrowprops=arrowprops)
# plot data
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
v = vector * 3 * np.sqrt(length)
draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal');
These vectors represent the principal axes of the data, and the length of the vector is an indication of how "important" that axis is in describing the distribution of the data
Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance.
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape: ", X.shape)
print("transformed shape:", X_pca.shape)
The transformed data has been reduced to a single dimension. To understand the effect of this dimensionality reduction, we can perform the inverse transform of this reduced data and plot it along with the original data:
X_new = pca.inverse_transform(X_pca)
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8)
plt.axis('equal');
The light points are the original data, while the dark points are the projected version. This makes clear what a PCA dimensionality reduction means: the information along the least important principal axis or axes is removed, leaving only the component(s) of the data with the highest variance
The usefulness of the dimensionality reduction becomes much more clear when looking at high-dimensional data
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
pca = PCA(2) # project from 64 to 2 dimensions
projected = pca.fit_transform(digits.data)
print(digits.data.shape)
print(projected.shape)
# We can now plot the first two principal components of each point to learn about the data:
plt.scatter(projected[:, 0], projected[:, 1],
c=digits.target, edgecolor='none', alpha=0.5,
cmap=plt.cm.get_cmap('Spectral', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();
PCA can be thought of as a process of choosing optimal basis functions, such that adding together just the first few of them is enough to suitably reconstruct the bulk of the elements in the dataset.
This can be determined by looking at the cumulative explained variance ratio as a function of the number of components:
pca = PCA().fit(digits.data)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
the digits the first 10 components contain approximately 75% of the variance, while you need around 50 components to describe close to 100% of the variance.
The idea is this: any components with variance much larger than the effect of the noise should be relatively unaffected by the noise.
So if you reconstruct the data using just the largest subset of principal components, you should be preferentially keeping the signal and throwing out the noise.
## Plot noise-free data
def plot_digits(data):
fig, axes = plt.subplots(4, 10, figsize=(10, 4),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(data[i].reshape(8, 8),
cmap='binary', interpolation='nearest',
clim=(0, 16))
plot_digits(digits.data)
# Lets add some random noise to create a noisy dataset, and re-plot it:
np.random.seed(42)
noisy = np.random.normal(digits.data, 4)
plot_digits(noisy)
pca = PCA(0.50).fit(noisy)
pca.n_components_
#Here 50% of the variance amounts to 12 principal components.
#compute these components, and then use the inverse of the transform to reconstruct the filtered digits:
components = pca.transform(noisy)
filtered = pca.inverse_transform(components)
plot_digits(filtered)
This signal preserving/noise filtering property makes PCA a very useful feature selection routine—for example, rather than training a classifier on very high-dimensional data, you might instead train the classifier on the lower-dimensional representation, which will automatically serve to filter out random noise in the inputs.
Let's take a look at the principal axes that span this dataset.
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)
pca = PCA(150)
pca.fit(faces.data)
# plot with first fews components
fig, axes = plt.subplots(3, 8, figsize=(9, 4),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(pca.components_[i].reshape(62, 47), cmap='bone')
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
# Compute the components and projected faces
pca = PCA(150).fit(faces.data)
components = pca.transform(faces.data)
projected = pca.inverse_transform(components)
# Plot with 150 componenets from 3000 initial features
fig, ax = plt.subplots(2, 10, figsize=(10, 2.5),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i in range(10):
ax[0, i].imshow(faces.data[i].reshape(62, 47), cmap='binary_r')
ax[1, i].imshow(projected[i].reshape(62, 47), cmap='binary_r')
ax[0, 0].set_ylabel('full-dim\ninput')
ax[1, 0].set_ylabel('150-dim\nreconstruction');
In this section we have discussed the use of principal component analysis for dimensionality reduction, for visualization of high-dimensional data, for noise filtering, and for feature selection within high-dimensional data. Because of the versatility and interpretability of PCA, it has been shown to be effective in a wide variety of contexts and disciplines. Given any high-dimensional dataset, I tend to start with PCA in order to visualize the relationship between points (as we did with the digits), to understand the main variance in the data (as we did with the eigenfaces), and to understand the intrinsic dimensionality (by plotting the explained variance ratio). Certainly PCA is not useful for every high-dimensional dataset, but it offers a straightforward and efficient path to gaining insight into high-dimensional data.
PCA's main weakness is that it tends to be highly affected by outliers in the data. For this reason, many robust variants of PCA have been developed, many of which act to iteratively discard data points that are poorly described by the initial components. Scikit-Learn contains a couple interesting variants on PCA, including RandomizedPCA and SparsePCA, both also in the sklearn.decomposition submodule. RandomizedPCA, which we saw earlier, uses a non-deterministic method to quickly approximate the first few principal components in very high-dimensional data, while SparsePCA introduces a regularization term (see In Depth: Linear Regression) that serves to enforce sparsity of the components.
In the following sections, we will look at other unsupervised learning methods that build on some of the ideas of PCA.
PCA can be used in the dimensionality reduction task—reducing the number of features of a dataset while maintaining the essential relationships between the points.
While PCA is flexible, fast, and easily interpretable, it does not perform so well when there are nonlinear relationships within the data
Manifold seeks to describe datasets as low-dimensional manifolds embedded in high-dimensional spaces by seek to learn about the fundamental two-dimensional nature of the paper
# sample data shape of HELLO
def make_hello(N=1000, rseed=42):
# Make a plot with "HELLO" text; save as PNG
fig, ax = plt.subplots(figsize=(4, 1))
fig.subplots_adjust(left=0, right=1, bottom=0, top=1)
ax.axis('off')
ax.text(0.5, 0.4, 'HELLO', va='center', ha='center', weight='bold', size=85)
fig.savefig('hello.png')
plt.close(fig)
# Open this PNG and draw random points from it
from matplotlib.image import imread
data = imread('hello.png')[::-1, :, 0].T
rng = np.random.RandomState(rseed)
X = rng.rand(4 * N, 2)
i, j = (X * data.shape).astype(int).T
mask = (data[i, j] < 1)
X = X[mask]
X[:, 0] *= (data.shape[0] / data.shape[1])
X = X[:N]
return X[np.argsort(X[:, 0])]
X = make_hello(1000)
colorize = dict(c=X[:, 0], cmap=plt.cm.get_cmap('rainbow', 5))
plt.scatter(X[:, 0], X[:, 1], **colorize)
plt.axis('equal');
# we can scale, shrink, or rotate the data, and the "HELLO" will still be apparent.
def rotate(X, angle):
theta = np.deg2rad(angle)
R = [[np.cos(theta), np.sin(theta)],
[-np.sin(theta), np.cos(theta)]]
return np.dot(X, R)
X2 = rotate(X, 20) + 5
plt.scatter(X2[:, 0], X2[:, 1], **colorize)
plt.axis('equal');
This tells us that the x and y values are not necessarily fundamental to the relationships in the data. What is fundamental, in this case, is the distance between each point and the other points in the dataset
# Let's use Scikit-Learn's efficient pairwise_distances function to do this for our original data:
from sklearn.metrics import pairwise_distances
D = pairwise_distances(X)
print(D.shape)
plt.imshow(D, zorder=2, cmap='Blues', interpolation='nearest')
plt.colorbar();
The usefulness of this becomes more apparent when we consider the fact that distance matrices can be computed from data in any dimension.
def random_projection(X, dimension=3, rseed=42):
assert dimension >= X.shape[1]
rng = np.random.RandomState(rseed)
C = rng.randn(dimension, dimension)
e, V = np.linalg.eigh(np.dot(C, C.T))
return np.dot(X, V[:X.shape[1]])
X3 = random_projection(X, 3)
X3.shape
from mpl_toolkits import mplot3d
ax = plt.axes(projection='3d')
ax.scatter3D(X3[:, 0], X3[:, 1], X3[:, 2],
**colorize)
ax.view_init(azim=70, elev=50)
# sk the MDS estimator to input this three-dimensional data, compute the distance matrix, and then determine the optimal two-dimensional embedding for this distance matrix.
from sklearn.manifold import MDS
model = MDS(n_components=2, random_state=1)
out3 = model.fit_transform(X3)
plt.scatter(out3[:, 0], out3[:, 1], **colorize)
plt.axis('equal');
This is essentially the goal of a manifold learning estimator: given high-dimensional embedded data, it seeks a low-dimensional representation of the data that preserves certain relationships within the data
The only clear advantage of manifold learning methods over PCA is their ability to preserve nonlinear relationships in the data; for that reason I tend to explore data with manifold methods only after first exploring them with PCA.
Clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete labeling of groups of points.
K-means clustering, which is implemented in sklearn.cluster.KMeans
The k-means algorithm searches for a pre-determined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like:
# First, let's generate a two-dimensional dataset containing four distinct blobs.
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.60, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=50);
# The k-means algorithm does this automatically,
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
#Let's visualize the results by plotting the data colored by these labels and the cluster centers
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
#approach to k-means involves an intuitive iterative approach known as expectation–maximization.
the expectation–maximization approach here consists of the following procedure:
Each repetition of the E-step and M-step will always result in a better estimate of the cluster characteristics.
Although E-M approach might not converged to optimal point, scikit learn do random assign first point multiple times to ensure we get to the opimal central point
from sklearn.datasets.samples_generator import make_blobs
from sklearn.metrics import pairwise_distances_argmin
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.60, random_state=0)
rng = np.random.RandomState(42)
centers = [0, 4] + rng.randn(4, 2)
def draw_points(ax, c, factor=1):
ax.scatter(X[:, 0], X[:, 1], c=c, cmap='viridis',
s=50 * factor, alpha=0.3)
def draw_centers(ax, centers, factor=1, alpha=1.0):
ax.scatter(centers[:, 0], centers[:, 1],
c=np.arange(4), cmap='viridis', s=200 * factor,
alpha=alpha)
ax.scatter(centers[:, 0], centers[:, 1],
c='black', s=50 * factor, alpha=alpha)
def make_ax(fig, gs):
ax = fig.add_subplot(gs)
ax.xaxis.set_major_formatter(plt.NullFormatter())
ax.yaxis.set_major_formatter(plt.NullFormatter())
return ax
fig = plt.figure(figsize=(15, 4))
gs = plt.GridSpec(4, 15, left=0.02, right=0.98, bottom=0.05, top=0.95, wspace=0.2, hspace=0.2)
ax0 = make_ax(fig, gs[:4, :4])
ax0.text(0.98, 0.98, "Random Initialization", transform=ax0.transAxes,
ha='right', va='top', size=16)
draw_points(ax0, 'gray', factor=2)
draw_centers(ax0, centers, factor=2)
for i in range(3):
ax1 = make_ax(fig, gs[:2, 4 + 2 * i:6 + 2 * i])
ax2 = make_ax(fig, gs[2:, 5 + 2 * i:7 + 2 * i])
# E-step
y_pred = pairwise_distances_argmin(X, centers)
draw_points(ax1, y_pred)
draw_centers(ax1, centers)
# M-step
new_centers = np.array([X[y_pred == i].mean(0) for i in range(4)])
draw_points(ax2, y_pred)
draw_centers(ax2, centers, alpha=0.3)
draw_centers(ax2, new_centers)
for i in range(4):
ax2.annotate('', new_centers[i], centers[i],
arrowprops=dict(arrowstyle='->', linewidth=1))
# Finish iteration
centers = new_centers
ax1.text(0.95, 0.95, "E-Step", transform=ax1.transAxes, ha='right', va='top', size=14)
ax2.text(0.95, 0.95, "M-Step", transform=ax2.transAxes, ha='right', va='top', size=14)
# Final E-step
y_pred = pairwise_distances_argmin(X, centers)
axf = make_ax(fig, gs[:4, -4:])
draw_points(axf, y_pred, factor=2)
draw_centers(axf, centers, factor=2)
axf.text(0.98, 0.98, "Final Clustering", transform=axf.transAxes,
ha='right', va='top', size=16)
Some time its difficult to answer if you dont have pre knowledge about the subject.
Alternatively, you might use a more complicated clustering algorithm which has a better quantitative measure of the fitness per number of clusters (e.g., Gaussian mixture models) or which can choose a suitable number of clusters (e.g., DBSCAN, mean-shift, or affinity propagation, all available in the sklearn.cluster submodule)
The fundamental model assumptions of k-means (points will be closer to their own cluster center than to others) means that the algorithm will often be ineffective if the clusters have complicated geometries.
from sklearn.datasets import make_moons
X, y = make_moons(200, noise=.05, random_state=0)
labels = KMeans(2, random_state=0).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels,
s=50, cmap='viridis');
This situation is reminiscent of the discussion in In-Depth: Support Vector Machines, where we used a kernel transformation to project the data into a higher dimension where a linear separation is possible. We might imagine using the same trick to allow k-means to discover non-linear boundaries.
One version of this kernelized k-means is implemented in Scikit-Learn within the SpectralClustering estimator. It uses the graph of nearest neighbors to compute a higher-dimensional representation of the data, and then assigns labels using a k-means algorithm:
from sklearn.cluster import SpectralClustering
model = SpectralClustering(n_clusters=2, affinity='nearest_neighbors',
assign_labels='kmeans')
labels = model.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels,
s=50, cmap='viridis');
Because each iteration of k-means must access every point in the dataset, the algorithm can be relatively slow as the number of samples grows
Alternatively we can use a subset of the data to update the cluster centers at each step. This is the idea behind batch-based k-means algorithms, one form of which is implemented in sklearn.cluster.MiniBatchKMeans. The interface for this is the same as for standard KMeans; we will see an example of its use as we continue our discussion.
attempt to use k-means to try to identify similar digits without using the original label information
# laod digit data
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
from sklearn.cluster import KMeans
# we know that number has 10
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape
# Let's see what these cluster centers look like:
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
axi.set(xticks=[], yticks=[])
axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)
We see that even without the labels, KMeans is able to find clusters whose centers are recognizable digits, with perhaps the exception of 1 and 8.
# check how accurate our unsupervised clustering was in finding similar digits within the data:
# by matching each learned cluster label with the true labels found in them:
from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(digits.target[mask])[0]
from sklearn.metrics import accuracy_score
accuracy_score(digits.target, labels)
# Let's check the confusion matrix for this:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(digits.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=digits.target_names,
yticklabels=digits.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');
# We see that the main point of confusion is between the eights and ones.
# let's try to push this even farther. We can use the t-distributed stochastic neighbor embedding (t-SNE) algorithm (mentioned in In-Depth: Manifold Learning) to pre-process the data before performing k-means.
from sklearn.manifold import TSNE
# Project the data: this step will take several seconds
tsne = TSNE(n_components=2, init='random', random_state=0)
digits_proj = tsne.fit_transform(digits.data)
# Compute the clusters
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits_proj)
# Permute the labels
labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(digits.target[mask])[0]
# Compute the accuracy
accuracy_score(digits.target, labels)
# That's nearly 92% classification accuracy without using the labels.
Inspire by weaknesses of k-means that it use of simple distance from cluster can lead to poor performance and do not have intrinsic measure of probability of cluster assignment, So it suited with circle data shape but not oblong or elliptical and so on. These two disadvantages of k-means—its lack of flexibility in cluster shape and lack of probabilistic cluster assignment
GMM(generating new data ) attempts to find a mixture of multi-dimensional Gaussian probability distributions that best model any input dataset.
# standard import
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns ; sns.set()
#Load sample data
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=400, centers=4,
cluster_std=0.60, random_state=0)
X = X[:, ::-1] # flip axes for better plotting
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4).fit(X)
#plot
labels = gmm.predict(X)
plt.scatter(X[:,0] , X[:,1], c=labels,s = 40, cmap='tab20b_r');
GMM contains a probabilistic model under the hood, it is also possible to find probabilistic cluster assignments
# return probability of assign to cluster
probs = gmm.predict_proba(X)
probs[:5].round(2)
# we can use the size of the cluster to reflect the probability of assignment (small = not sure = low prob)
size = 50 * probs.max(1)**2 # .max() get only the highest value which is the one that got assigned
plt.scatter(X[:,0],X[:,1], s = size , cmap='tab20b_r', c= labels);
Under the hood, a Gaussian mixture model is very similar to k-means: it uses an expectation–maximization approach which qualitatively does the following:
Choose starting guesses for the location and shape
Repeat until converged:
The result of this is that each cluster is associated not with a hard-edged sphere, but with a smooth Gaussian model
# create a function that visualize the locations and shapes of the GMM clusters by drawing ellipses based on the GMM output
from matplotlib.patches import Ellipse
def draw_ellipse(position, covariance, ax=None, **kwargs):
"""Draw an ellipse with a given position and covariance"""
ax = ax or plt.gca()
# Convert covariance to principal axes
if covariance.shape == (2, 2):
U, s, Vt = np.linalg.svd(covariance)
angle = np.degrees(np.arctan2(U[1, 0], U[0, 0]))
width, height = 2 * np.sqrt(s)
else:
angle = 0
width, height = 2 * np.sqrt(covariance)
# Draw the Ellipse
for nsig in range(1, 4):
ax.add_patch(Ellipse(position, nsig * width, nsig * height,
angle, **kwargs))
def plot_gmm(gmm, X, label=True, ax=None):
ax = ax or plt.gca()
labels = gmm.fit(X).predict(X)
if label:
ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
else:
ax.scatter(X[:, 0], X[:, 1], s=40, zorder=2)
ax.axis('equal')
w_factor = 0.2 / gmm.weights_.max()
for pos, covar, w in zip(gmm.means_, gmm.covariances_, gmm.weights_):
draw_ellipse(pos, covar, alpha=w * w_factor)
# see the shape of cluster
gmm = GaussianMixture(n_components=4)
plot_gmm(gmm, X)
# try with ellip shape data
rng = np.random.RandomState(13)
X_stretched = np.dot(X, rng.randn(2, 2))
gmm = GaussianMixture(n_components=4, covariance_type='full', random_state=42)
plot_gmm(gmm, X_stretched)
This hyperparameter controls the degrees of freedom in the shape of each cluster3
From less complex to more complex (computation expensive)
The result of a GMM fit to some data is technically not a clustering model, but a generative probabilistic model describing the distribution of the data.
from sklearn.datasets import make_moons
Xmoon , ymoon = make_moons(n_samples = 400,noise = 0.05)
plt.scatter(Xmoon[:,0],Xmoon[:,1], s = 20);
# Not useful as a clustering
gmm2 = GaussianMixture(n_components=2, covariance_type='full').fit(Xmoon)
labels = gmm2.predict(Xmoon)
plot_gmm(gmm2,Xmoon)
# we can use GMM as a generative model
gmm16 = GaussianMixture(n_components=16, covariance_type='full', random_state=0)
plot_gmm(gmm16, Xmoon, label=False)
This is mixture of 16 Gaussians serves not to find separated clusters of data, but rather to model the overall distribution of the input data. GMM gives us the recipe to generate new random data distributed similarly to our input.
For example, here are 400 new points drawn from this 16-component GMM fit to our original data:
Xnew = list(gmm16.sample(400))
# create random sample with same shape
plt.scatter(Xnew[0][:, 0], Xnew[0][:, 1]);
we can simply evaluate the likelihood of the data under the model, using cross-validation to avoid over-fitting.
Another means of correcting for over-fitting is use Akaike information criterion (AIC) or the Bayesian information criterion (BIC)
# lets look at AIC and BIC of make_moon dataset
number_of_compenents = np.arange(1,21)
#create model with list comprehension to loop n values
models = [GaussianMixture(n_components= n,covariance_type='full', random_state=0).fit(Xmoon) for n in number_of_compenents]
#plot
plt.plot(number_of_compenents, [m.bic(Xmoon) for m in models], label='BIC')
plt.plot(number_of_compenents, [m.aic(Xmoon) for m in models], label='AIC')
plt.legend(loc='best')
plt.xlabel('n_components');
The optimal number of clusters is the value that minimizes the AIC or BIC. AIC and BIC recommend 10 cluster.
Notice the important point: this choice of number of components measures how well GMM works as a density estimator, not how well it works as a clustering
Using GMM generate new handwritten digits
#import data
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
# create function to plot digit
def plot_digits(data):
fig, ax = plt.subplots(10, 10, figsize=(8, 8),
subplot_kw=dict(xticks=[], yticks=[]))
fig.subplots_adjust(hspace=0.05, wspace=0.05)
for i, axi in enumerate(ax.flat):
im = axi.imshow(data[i].reshape(8, 8), cmap='binary')
im.set_clim(0, 16)
plot_digits(digits.data)
GMMs can have difficulty converging in such a high dimensional space, so we will start with an invertible dimensionality reduction algorithm on the data. Here we will use a straightforward PCA, asking it to preserve 99% of the variance in the projected data:
# PCA before GMM
from sklearn.decomposition import PCA
pca = PCA(0.99, whiten=True)
data = pca.fit_transform(digits.data)
#plot
data.shape
We can reduce from 64 to 41 dimension with almost no info lost.
lets use AIC to see how many GMM component we should use.
n_components = np.arange(50, 210, 10)
models = [GaussianMixture(n, covariance_type='full')
for n in n_components]
aics = [model.fit(data).aic(data) for model in models]
plt.plot(n_components, aics);
It appears that around 140 components minimizes the AIC.
lets try 140 with GMM
gmm = GaussianMixture(n_components=140)
gmm.fit(data)
print(gmm.converged_)
data_new = gmm.sample(100)[0]
data_new.shape
Finally, we can use the inverse transform of the PCA object to construct the new digits:
digits_new = pca.inverse_transform(data_new)
plot_digits(digits_new)
GMM is kinda hybrid between cluster estimator and density estimator.
Recall that a density estimator is an algorithm which takes a D-dimensional dataset and produces an estimate of the D-dimensional probability distribution which that data is drawn from. The GMM algorithm accomplishes this by representing the density as a weighted sum of Gaussian distributions.
Kernel density estimation uses a mixture consisting of one Gaussian component per point, resulting in an essentially non-parametric estimator of density
The free parameters of kernel density estimation are the kernel, which specifies the shape of the distribution placed at each point, and the kernel bandwidth, which controls the size of the kernel at each point
# create some data that is drawn from two normal distributions:
def make_data(N, f=0.3, rseed=1):
rand = np.random.RandomState(rseed)
x = rand.randn(N)
x[int(f * N):] += 5
return x
x = make_data(1000)
x_d = np.linspace(-4, 8, 1000)
from sklearn.neighbors import KernelDensity
# instantiate and fit the KDE model
kde = KernelDensity(bandwidth=0.35, kernel='gaussian')
kde.fit(x[:, None])
# score_samples returns the log of the probability density
logprob = kde.score_samples(x_d[:, None])
plt.fill_between(x_d, np.exp(logprob), alpha=0.5)
plt.plot(x, np.full_like(x, -0.01), '|k', markeredgewidth=1)
plt.ylim(-0.02, 0.22)
Here we will use GridSearchCV to optimize the bandwidth for the preceding dataset. Because we are looking at such a small dataset, we will use leave-one-out cross-validation, which minimizes the reduction in training set size for each cross-validation trial:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import LeaveOneOut
bandwidths = 10 ** np.linspace(-1, 1, 100)
grid = GridSearchCV(KernelDensity(kernel='gaussian'),
{'bandwidth': bandwidths},
cv=LeaveOneOut())
grid.fit(x[:, None]);
# the choice of bandwidth which maximizes the score (which in this case defaults to the log-likelihood)
grid.best_params_